[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T0000). Please do the needful. [00:02:23] (03PS3) 10Jdlrobson: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) [00:15:50] (03CR) 10Alex Monk: "(Note: The Jenkins failure appears to be bogus)" [software] - 10https://gerrit.wikimedia.org/r/295598 (owner: 10Alex Monk) [00:39:05] (03PS1) 10Yurik: Configure Kartotherian geoshapes support [puppet] - 10https://gerrit.wikimedia.org/r/295602 (https://phabricator.wikimedia.org/T134084) [00:41:13] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: puppet fail [01:08:36] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [01:53:55] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2400928 (10AlexMonk-WMF) a:03AlexMonk-WMF I'm having a go at this. > It blows up and rebuilds all wikis on every run. It truncates the meta_p.wiki... [02:16:51] (03PS1) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 [02:17:20] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (owner: 10Alex Monk) [02:17:22] (03PS2) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 [02:17:48] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (owner: 10Alex Monk) [02:26:16] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 11m 19s) [02:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:15] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.7) (duration: 07m 05s) [02:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 23 02:47:59 UTC 2016 (duration 6m 44s) [02:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:15] (03PS4) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) [03:11:49] (03PS3) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 [03:12:05] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (owner: 10Alex Monk) [03:13:15] (03PS4) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 [03:13:32] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (owner: 10Alex Monk) [03:14:11] (03PS5) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [03:14:28] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [03:17:55] (03PS1) 10Alex Monk: Couple of tiny maintain-meta_p.py improvements [software] - 10https://gerrit.wikimedia.org/r/295608 [03:18:13] (03CR) 10jenkins-bot: [V: 04-1] Couple of tiny maintain-meta_p.py improvements [software] - 10https://gerrit.wikimedia.org/r/295608 (owner: 10Alex Monk) [03:52:04] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 730 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5196960 keys - replication_delay is 730 [03:56:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5165141 keys - replication_delay is 0 [04:06:12] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2400964 (10mehtab.ahmed) Author still needs s couple days. [06:11:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:12:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:13:13] (03PS1) 10KartikMistry: apertium-eo-es: Rebuild for Jessie, cleanup [debs/contenttranslation/apertium-eo-es] - 10https://gerrit.wikimedia.org/r/295611 (https://phabricator.wikimedia.org/T107306) [06:15:24] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2400994 (10KartikMistry) [06:16:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:19:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:25:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:25:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:29:37] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:30:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:31:07] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:06] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:25] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:25] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:07] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:15] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:07] !log installing spice security updates [06:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:09:16] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2401018 (10jcrespo) It is a bit more complex than that- we need to failover the slave actions to the master (and use only the master). Then (for example, the following week) we need to... [07:14:04] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2401019 (10jcrespo) [07:15:54] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#1972522 (10jcrespo) I have created T138460 specifically for Phabricator. Related to T137928#2389155, too. [07:18:55] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2401040 (10jcrespo) labsdb1002 will never get fixed. [07:23:08] (03PS1) 10KartikMistry: apertium-es-ast: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/295624 (https://phabricator.wikimedia.org/T107306) [07:24:31] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2401055 (10KartikMistry) [07:25:28] Krenair: o.O what timezone are you in right now? [07:25:50] Krenair: probably zuul-merger has a corrupt copy of the operations/sofware repo [07:35:08] (03PS1) 10KartikMistry: apertium-es-gl: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/295625 (https://phabricator.wikimedia.org/T107306) [07:35:18] Krenair eschews timezones [07:35:50] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2401071 (10KartikMistry) [07:46:32] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/295608 (owner: 10Alex Monk) [07:49:53] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/295608 (owner: 10Alex Monk) [07:51:03] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/295598 (owner: 10Alex Monk) [07:51:21] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [07:51:24] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/295564 (https://phabricator.wikimedia.org/T135029) (owner: 10Ori.livneh) [07:51:46] (03PS3) 10ArielGlenn: add job that dumps history of flow pages [dumps] - 10https://gerrit.wikimedia.org/r/295587 (https://phabricator.wikimedia.org/T89398) [07:55:14] (03CR) 10Legoktm: "@Ariel: this is ready to be merged now! Everything else is in place, so I'll be turning on the extension for usage next week after Wikiman" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [07:56:41] ACKNOWLEDGEMENT - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Puppet has 12 failures Filippo Giunchedi https://gerrit.wikimedia.org/r/295492 [07:56:41] ACKNOWLEDGEMENT - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Puppet has 12 failures Filippo Giunchedi https://gerrit.wikimedia.org/r/295492 [07:56:41] ACKNOWLEDGEMENT - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Puppet has 13 failures Filippo Giunchedi https://gerrit.wikimedia.org/r/295492 [07:56:41] ACKNOWLEDGEMENT - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 12 failures Filippo Giunchedi https://gerrit.wikimedia.org/r/295492 [07:56:41] ACKNOWLEDGEMENT - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Puppet has 12 failures Filippo Giunchedi https://gerrit.wikimedia.org/r/295492 [07:57:32] good morning [08:00:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [08:03:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5159071 keys - replication_delay is 0 [08:06:50] !log change-prop deploying 45db4f84827 [08:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:51] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2401109 (10Jdforrester-WMF) 05stalled>03Resolved a:03Jdforrester-WMF In that case, I'm declaring this fixed. [08:11:29] (03PS1) 10Jcrespo: [WIP] Delete deprecated modules coredb_mysql and mysql_wmf [puppet] - 10https://gerrit.wikimedia.org/r/295628 [08:15:24] (03CR) 10Gehel: [C: 031] "restbase1001 still failing for the same unrelated reason. Otherwise lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [08:18:39] 06Operations, 10Wikimedia-SVG-rendering, 13Patch-For-Review: Install Amiri font (arabic) for svg - https://phabricator.wikimedia.org/T135347#2401125 (10MoritzMuehlenhoff) @Uwe_a : When testing this I noticed that the Amiri font is in fact already installed on the image scalers: It was installed indirecty as... [08:20:11] (03PS10) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [08:22:11] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2400728 (10jcrespo) I have to add a view to a newly created labs-only table, so it is created for new wikis, too: ``` MariaDB L... [08:23:25] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2401137 (10jcrespo) [08:25:23] (03CR) 10Filippo Giunchedi: prometheus: add nginx reverse proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [08:26:28] (03PS2) 10Gehel: Moving elasticsearch masters to new servers [puppet] - 10https://gerrit.wikimedia.org/r/295585 (https://phabricator.wikimedia.org/T138329) [08:27:07] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258408 (10Jdforrester-WMF) This is now done, right? [08:27:54] mobrovac I'm happy to merge https://gerrit.wikimedia.org/r/#/c/295576/5 now if you or someone from services is around [08:28:06] yuvipanda: cool, thnx [08:28:23] yuvipanda: yes, we're all on the bench in the park waiting for the other rooms to open :) [08:28:42] yuvipanda: so you can go ahead and merge it [08:29:06] (03PS2) 10Muehlenhoff: Add Amiri font to the scalers [puppet] - 10https://gerrit.wikimedia.org/r/295498 (https://phabricator.wikimedia.org/T135347) [08:30:12] (03PS6) 10Yuvipanda: Change-Prop: Added rules for ORES cache updates [puppet] - 10https://gerrit.wikimedia.org/r/295576 (owner: 10Ppchelko) [08:30:34] (03CR) 10Yuvipanda: [C: 032 V: 032] Change-Prop: Added rules for ORES cache updates [puppet] - 10https://gerrit.wikimedia.org/r/295576 (owner: 10Ppchelko) [08:31:05] mobrovac done [08:31:13] thnx yuvipanda! [08:33:48] yuvipanda: did you merge it on the puppetmaster? [08:34:24] mobrovac yup [08:34:32] mobrovac failed on strontium, fixing [08:34:45] mobrovac done [08:34:54] cheers! [08:39:09] !log change-prop restarting on scb to pick up ores rules https://gerrit.wikimedia.org/r/295576 [08:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:18] Amir1, Pchelolo, akosiaris: ^^ [08:39:45] (03PS2) 10Filippo Giunchedi: swift: align partition to 1M boundary [puppet] - 10https://gerrit.wikimedia.org/r/295492 [08:39:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: align partition to 1M boundary [puppet] - 10https://gerrit.wikimedia.org/r/295492 (owner: 10Filippo Giunchedi) [08:40:16] (03Abandoned) 10Legoktm: Apache redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [08:40:17] nice, thank [08:40:42] *thanks [08:55:19] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:58:05] _joe_: Small: https://nn.wikipedia.org/wiki/Spesial:AboutTopic/Q1955993 Medium: https://nn.wikipedia.org/wiki/Spesial:AboutTopic/Q105598 (Very) large: https://nn.wikipedia.org/wiki/Spesial:AboutTopic/Q2150573 [08:58:54] Given the item sizes on Wikidata, I guess the average will be between small and medium [08:59:10] PROBLEM - HHVM rendering on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:01:29] RECOVERY - HHVM rendering on mw1160 is OK: HTTP OK: HTTP/1.1 200 OK - 66150 bytes in 0.250 second response time [09:02:01] (03PS1) 10Legoktm: Have "https://w.wiki/" do a 301 to Meta-Wiki [puppet] - 10https://gerrit.wikimedia.org/r/295632 [09:02:43] (03PS2) 10Legoktm: Have "https://w.wiki/" do a 301 to Meta-Wiki [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) [09:04:10] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:07:29] RECOVERY - swift-object-server on ms-be2022 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:08:08] RECOVERY - swift-container-server on ms-be2022 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:08:09] RECOVERY - swift-account-server on ms-be2022 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:08:29] RECOVERY - swift-container-updater on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:09:19] RECOVERY - swift-object-updater on ms-be2022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:11:09] !log syncing etherpadlite.store (m1) on db2010, which had 2 bad chunks [09:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:16:28] 06Operations, 10Wikimedia-SVG-rendering: PNG thumbnail preview of SVG misses some text - https://phabricator.wikimedia.org/T123106#2401262 (10MoritzMuehlenhoff) [09:18:12] 06Operations, 10Wikimedia-SVG-rendering: PNG thumbnail preview of SVG misses some text - https://phabricator.wikimedia.org/T123106#1921435 (10MoritzMuehlenhoff) @Efa Thanks for the detailed bug report. This is a bug in the librsvg library we use to generate the PNG thumbnails. I have reproduced that this still... [09:23:27] (03PS3) 10Ema: Have "https://w.wiki/" do a 301 to Meta-Wiki [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) (owner: 10Legoktm) [09:24:27] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) (owner: 10Legoktm) [09:35:23] (03CR) 10Dereckson: "Superseded by e031db9e, which handles URL shortener requests at Varnish level." [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [09:43:02] (03PS1) 10Gehel: Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 [09:44:43] (03PS1) 10Hashar: contint: tidy Nodepool slaves config history [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) [09:45:22] (03CR) 10Daniel Kinzler: [C: 031] Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [09:45:31] (03CR) 10Gehel: "If I understand correctly, once this is merged, there is nothing more to do (no restart of LVS / pybal / ...)." [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [09:47:09] (03CR) 10Hashar: "I have no idea how to properly test the puppet tidy type. Though based on puppet 3.4.3 source code that looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [09:48:34] 06Operations, 10media-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918#2401342 (10fgiunchedi) doable also post-puppet but before machines are in services (i.e. many files owned by swift) ``` swift-init all stop userdel swift groupdel swift group... [09:49:16] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2401356 (10Dereckson) Thanks for the update. If the author has some questions about licensing, I'll be happy to answer them. [09:49:23] !log reimage ms-be202[567] with incorrect raid settings [09:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:02] (03CR) 10DCausse: [C: 031] Moving elasticsearch masters to new servers [puppet] - 10https://gerrit.wikimedia.org/r/295585 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [09:50:05] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987#2401359 (10MoritzMuehlenhoff) [09:52:12] (03PS4) 10Legoktm: Have "https://w.wiki/" do a 301 to Meta-Wiki [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) [09:54:17] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766457 (10Andrew) I just spoke to Kelson about this, and I'm willing to set this up if we can provide him with the hardware. Adding a second bare-metal server will be an 'interestin... [09:55:25] (03PS5) 10Legoktm: Have "https://w.wiki/" do a 301 to Meta-Wiki [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) [09:58:06] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/295632 (https://phabricator.wikimedia.org/T133485) (owner: 10Legoktm) [10:02:08] (03CR) 10Alexandros Kosiaris: [C: 031] "+1. Looks fine. I 'd be depooling first the servers on palladium using confctl but it is not strictly required." [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [10:04:53] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: Connection timed out [10:04:53] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: Connection timed out [10:06:42] PROBLEM - Disk space on mw1301 is CRITICAL: Timeout while attempting connection [10:06:43] PROBLEM - Disk space on mw1300 is CRITICAL: Timeout while attempting connection [10:07:00] this is me [10:07:32] (new jobrunners) [10:07:49] should be silenced now [10:08:25] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987#657868 (10MoritzMuehlenhoff) That bug is fixed on the new jessie image scaler using 2.4.16 (tested locally, it's not yet pooled into the set o... [10:08:33] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987#2401422 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:09:23] (03PS1) 10Gehel: Decommission old maps servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) [10:09:50] (03CR) 10Gehel: [C: 04-1] "Not to merge before traffic is moved off those servers" [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [10:13:36] !log restarting rabbitmq-server on labcontrol1001 (random debugging attempt for T138106) [10:13:40] T138106: Nodepool has trouble taking snapshots on OpenStack labs - https://phabricator.wikimedia.org/T138106 [10:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:17:27] (03CR) 10Gehel: "maps-test* servers have already been depooled:" [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [10:23:03] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [10:33:28] jynus: hey, if you are around I have a DBA performance question [10:33:39] tell me when you have some minutes [10:35:15] ask- but I always prefer that you create a ticket [10:35:53] PROBLEM - swift-account-auditor on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:36:23] PROBLEM - swift-account-reaper on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:36:43] PROBLEM - swift-account-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:39:00] jynus: https://phabricator.wikimedia.org/T138444 [10:39:03] PROBLEM - swift-object-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:39:04] made it [10:39:10] https://gerrit.wikimedia.org/r/#/c/295528/3/includes/Hooks.php [10:39:22] PROBLEM - swift-object-server on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:39:38] we want to know if joining with revision table can make this query faster [10:41:13] We're already using that "hack" in a few places in MediaWiki itself, because there's no index on rc_this_id, but one on rc_timestamp [10:41:23] yup [10:42:02] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Incorrect text positioning in SVG rasterization (any extreme down scale) (fixed in upstream 2.40.13) - https://phabricator.wikimedia.org/T65703#2401512 (10MoritzMuehlenhoff) [10:42:26] (03PS1) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [10:43:23] RECOVERY - swift-account-reaper on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:43:32] jynus: ^ [10:43:43] RECOVERY - swift-object-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:43:43] RECOVERY - swift-account-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:44:02] RECOVERY - swift-object-server on ms-be2025 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:45:13] RECOVERY - swift-account-auditor on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:46:22] (03PS1) 10Jcrespo: [WIP] Move all misc db scripts to db_maintenance module [puppet] - 10https://gerrit.wikimedia.org/r/295654 [10:46:42] RECOVERY - Disk space on mw1301 is OK: DISK OK [10:48:24] joins are not a problem in general [10:48:49] joining with revision is, given that it is the largest table of all our infrastructure [10:49:09] and recentchanges was created to avoid using it [10:50:11] (03CR) 10Nikerabbit: [C: 04-1] Deploy Compact Language Links as default (Stage 2) (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [10:50:23] RECOVERY - Disk space on mw1300 is OK: DISK OK [10:51:04] this is not a matter of opinion, plase generate a query from labs -or anywhere you have a test env-, paste them and I can check them on several wikis [10:51:09] Amir1^ [10:51:33] we did that before, let me find and paste it here [10:51:53] do not paste it here [10:51:57] paste it on the task [10:52:18] try to centralize things there- this is ok for a heads up, but the rest is better there [10:54:12] jynus: https://phabricator.wikimedia.org/T138444#2401522 [10:54:21] did that already [10:55:33] Test it in wikidata [10:55:36] jynus: ^ [10:56:47] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Move all misc db scripts to db_maintenance module [puppet] - 10https://gerrit.wikimedia.org/r/295654 (owner: 10Jcrespo) [10:58:24] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.429 second response time [10:59:16] I need to go for lunch, I'll be back soon [11:00:03] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.015 second response time [11:03:23] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [11:11:31] Some issues on etherpad atm it seems (I assume because of hackathon attention?) https://usercontent.irccloud-cdn.com/file/yD9IQ6Jo/etherpaderror [11:15:41] (03PS34) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [11:15:49] (03PS3) 10KartikMistry: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) [11:17:10] 06Operations, 13Patch-For-Review: Contain imagemagick on the image scalers with firejail - https://phabricator.wikimedia.org/T135111#2401558 (10MoritzMuehlenhoff) 05Open>03Resolved This is enabled on the image scalers (and app servers for the Score extensions) since last week [11:17:42] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:56] (03PS1) 10Gehel: Add new elasticsearch servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/295657 (https://phabricator.wikimedia.org/T138329) [11:30:33] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:32:39] (03PS1) 10Muehlenhoff: Add firejail wrapper for rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/295659 [11:32:58] !log rolling restart of elasticsearch10(01|30|08|36|13|40) to activate new masters [11:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:05] (03PS3) 10Gehel: Moving elasticsearch masters to new servers [puppet] - 10https://gerrit.wikimedia.org/r/295585 (https://phabricator.wikimedia.org/T138329) [11:35:18] (03CR) 10Gehel: [C: 032] Moving elasticsearch masters to new servers [puppet] - 10https://gerrit.wikimedia.org/r/295585 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [11:37:32] back [11:37:55] jynus: it would be great if you check it [11:38:16] Would anyone be able to point me to the puppet code hat describes our storage setup for graphite? maybe _joe_ ? Reading teh graphite module looks like it does not have storage [11:38:19] *the [11:38:28] (03PS1) 10Jcrespo: Depool db1059; Repool db1061 & db1062; increase weight of db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295661 [11:38:31] Amir1, I will [11:38:38] are you in a rush? [11:39:05] (03PS4) 10KartikMistry: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) [11:39:14] jynus: we have a showcase in one hour, I thought it would be great if we can show it to people [11:39:47] you want to deploy to production? [11:44:13] nuria_, I think ~/puppet/modules/graphite/manifests/init.pp has everthing it needs on storage side [11:44:29] jynus: thank you, looking [11:45:17] there is of course the cluster on top of that [11:46:04] plus if you are interested on a future setup, we have as a more promising solution (for us) prometheus [11:47:05] jynus: not deploying, just telling people that we merged it and it'll be there soon [11:47:36] well, then no rush- if someone complains, tell them it will be [11:47:55] okay :) [11:48:02] or if someone complains tell them is all my fault [11:49:49] (03CR) 10Jcrespo: [C: 032] Depool db1059; Repool db1061 & db1062; increase weight of db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295661 (owner: 10Jcrespo) [11:51:42] PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: puppet fail [11:52:29] (03PS2) 10Muehlenhoff: Add firejail wrapper for rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/295659 [11:54:30] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2401650 (10Dzahn) No, i don't think it is done. Still what Rob described above. [11:58:30] (03CR) 10Legoktm: [C: 031] Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [12:00:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add firejail wrapper for rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/295659 (owner: 10Muehlenhoff) [12:01:32] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2401684 (10jcrespo) @Dzahn check your mail. [12:07:05] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1059; Repool db1061 & db1062; increase weight of db1068 (duration: 00m 39s) [12:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:02] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2401704 (10Dzahn) Yay! Thank you! [12:14:58] <_joe_> moritzm: wow nice! [12:15:05] <_joe_> (firejail for rsvg-convert) [12:17:33] RECOVERY - puppet last run on sca2002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:20:07] (03CR) 10Alexandros Kosiaris: "PCC is happy at https://puppet-compiler.wmflabs.org/3168/carbon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [12:31:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:51] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [12:32:13] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:32:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [12:32:31] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:19] (03PS2) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [12:33:52] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:34:01] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [12:34:41] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:34:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [12:34:52] _joe_: not yet enabled so far. will first pool mw1291 for some 15 mins of smoketesting in a bit [12:39:37] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2401809 (10jcrespo) >> It blows up and rebuilds all wikis on every run. >It truncates the meta_p.wiki table but it doesn't drop... [12:40:11] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:02] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:21] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [12:41:31] same issue as yesterday? [12:41:51] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:42:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [12:43:26] again this shit? [12:43:36] akosiaris: weren't you fixing that/ [12:43:45] I am just asking, how can I check [12:43:54] no, he wasn't the one [12:44:19] joe implemented it, he said, but it is not his fault [12:44:48] <_joe_> I implemented the checker, not the specs [12:44:52] let's calm down :-) [12:44:54] <_joe_> that's what should be removed [12:45:06] yes, I understood it like that [12:45:13] paravoid: fixing ? [12:45:14] how ? [12:45:16] <_joe_> (thst is monitoring basically an external resource) [12:45:25] yes [12:45:38] yeah the gov database about PMCID [12:45:49] <_joe_> who is responsible for citoid? [12:45:54] _joe_: good thing you haven't set those to paging [12:45:58] <_joe_> I CAN PESTER PEOPLE IRL FOR ONCE [12:46:00] _joe_: mobrovac, how else ? [12:46:04] who* [12:46:27] <_joe_> actually, people are pestering me more than I like :P [12:46:28] somehow mobrovac is responsible for 70% of the services [12:46:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [12:46:31] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:46:41] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:47:02] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:47:03] 06Operations, 10ops-eqiad, 06DC-Ops: mw1302.eqiad.wmnet issues while booting - https://phabricator.wikimedia.org/T138485#2401862 (10elukey) [12:47:21] _joe_: anyway, the only way to actually fix that is have the spec inform the checker that this endpoint either should not be monitored or is ok to return an error [12:47:44] effectively both mean "not monitored" [12:47:46] <_joe_> akosiaris: yes, I am supposed to spin off service_checker from the puppet repo today [12:47:52] akosiaris: i can just remove these checks from the spec for the time being, i guess [12:47:59] <_joe_> mobrovac: yes [12:48:03] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:48:04] <_joe_> mobrovac: WHERE ARE U [12:48:12] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [12:48:14] that would be a solution. I 'd appreciate it :-) [12:48:20] <_joe_> I want to storm to a dev's desk shouting "you fix this shit" [12:48:22] <_joe_> :D [12:48:28] _joe_: IN ROOM 30, but i'm gonna get out soon [12:48:42] _joe_: you have a few secs.. RRRRRRRRRRRRRRUUUUUUUUUUUUN!!!!!! [12:48:42] <_joe_> mobrovac: you can't hide for long!! [12:48:51] lol [12:48:55] <_joe_> akosiaris: no I am busy making fun of yurik [12:48:59] <_joe_> err yuvipanda [12:49:04] <_joe_> sorry yurik [12:49:05] <_joe_> :) [12:49:35] !log pooling new jessie image scaler mw1291 for short production smoke testing [12:49:38] <_joe_> mobrovac: WHERE ARE U .> [12:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:56] vb.net ? [12:50:03] p858snake|_: that'll never get old :) [12:50:08] lol [12:50:22] akosiaris: https://www.youtube.com/watch?v=hkDD03yeLnU [12:50:26] <_joe_> p858snake|_: we're in the same physical place, at wikimania [12:51:13] <_joe_> mobrovac: you will appreciate https://www.youtube.com/watch?v=s5ocXFgowZA [12:51:14] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, +1 on merging to move forward" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [12:51:14] mobrovac: thanks for the link I somehow missed this pearl before today [12:51:18] <_joe_> (it's italian, sorry) [12:51:37] "un debian!" [12:51:47] <_joe_> elukey: never gets old, right? [12:51:50] nope [12:54:06] (03CR) 10DCausse: [C: 031] Add new elasticsearch servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/295657 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [12:54:15] _joe_, what does the expert say "debian, similar to linux code?" something like that? [12:55:13] <_joe_> jynus: he says, let me try to translate [12:55:23] (03CR) 10DCausse: Decommission old maps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [12:55:27] <_joe_> "holy shit! it's an debian! similar to linux" [12:56:01] /r/itsaunixsystem [12:56:27] I think the worst offender is this one: https://www.youtube.com/watch?v=u8qgehH3kEQ [12:56:56] "molto simile a linux" [12:56:57] hahaha [12:56:59] heheh NCIS delivers [12:57:16] (03CR) 10Gehel: Decommission old maps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [12:58:38] (03PS2) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) [12:58:38] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2401926 (10BBlack) [[ https://tools.ietf.org/html/rfc7905 | RFC 7905 ]] is published! Now we just need a released version of openssl 1.1.x :) We could test a build of openssl's master branch on cp1... [12:58:52] (03CR) 10Gehel: Decommission old elasticsearch servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) (owner: 10Gehel) [13:00:25] * yurik throws a banana at _joe_ [13:03:17] (03PS1) 10Ppchelko: Change-Prop: Ignore certain errors on page_delete and null_edit. [puppet] - 10https://gerrit.wikimedia.org/r/295680 [13:09:05] !log depooled jessie image scaler (mw1291) again, works fine, to be permanently pooled on Monday [13:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:25] !log purged some puppet output logs on compiler02.puppet3-diffs.eqiad.wmflabs to free space (disk full) [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:47] I synced a change but it did not get to some mediawikis, and now they are querying wrong db hosts [13:11:48] !log citoid deploying 0129ab0b [13:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:24] akosiaris: ^^ done, now the spec checks only citoid and zotero [13:13:06] PROBLEM - mediawiki-installation DSH group on mw1300 is CRITICAL: Host mw1300 is not in mediawiki-installation dsh group [13:13:06] PROBLEM - mediawiki-installation DSH group on mw1301 is CRITICAL: Host mw1301 is not in mediawiki-installation dsh group [13:13:11] !log running scap pool on mw1300 [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:17] ah! [13:13:20] there it is [13:13:43] those hosts are pooled but not being updated [13:13:50] !log restarting zotero on sca, 6g mem [13:13:51] which is really dangerous [13:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:57] mobrovac: thanks! [13:14:19] jynus: those are new jobrunners :) [13:14:45] elukey, it is ok if they are not updated, but please repool it [13:14:52] production queries are running on them [13:15:05] how is that possible? [13:15:25] I thought I needed to add them in puppet [13:15:43] allow me to update them so they do not fail, you can continue investigating [13:15:57] oh sure go ahead, sorry for the trouble [13:16:04] I thought that I needed to explicitly pool them first [13:16:07] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2401975 (10MoritzMuehlenhoff) 1.1.0~pre5 is in Debian experimental. It has quite some API changes, though. https://wiki.openssl.org/index.php/1.1_API_Changes In a rebuild of the Debian archive over... [13:16:09] (03CR) 10Alexandros Kosiaris: [C: 032] "Merging then" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [13:16:16] (03PS35) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [13:16:21] (03CR) 10Alexandros Kosiaris: [V: 032] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [13:16:25] elukey, see that I am not lying: https://phabricator.wikimedia.org/P3304 [13:16:56] !log running scap pool on mw1301 [13:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:26] jynus: didn't think that you were lying, I was just surprised :) [13:17:38] well, I could be wrong [13:17:56] that is why I am sharing the same thing I saw, so you have more info [13:18:02] thanks! [13:18:26] afaik the jobrunners need to be in hiera before starting to pull jobs from the queues [13:18:35] mmm [13:18:38] mmm, not sure about that [13:18:56] they were controled with salt in the past [13:19:13] but if this confirms, please report a bug [13:20:35] I have two more coming up to speed so I am going to check that now :) [13:21:56] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: puppet fail [13:21:56] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: puppet fail [13:22:15] maybe I confused them with the hiera config for the job queues itself [13:27:29] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2402012 (10fgiunchedi) [13:27:31] 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2402011 (10fgiunchedi) [13:29:05] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2402015 (10BBlack) Yeah it's going to be a big transition. I've seen openssl-1.1.x-related patches in nginx master though (which is basically what we're running), so I'm crossing fingers that nginx... [13:29:29] (03PS5) 10KartikMistry: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) [13:30:16] !log db1059 backup and reimage [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:15] I kept only puppet compiler outputs up to 40 days ago on compiler02.puppet3-diffs.eqiad.wmflabs to free space [13:33:28] FYI to everybody [13:33:41] hope that I didn't cancel something important [13:34:49] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: Connection timed out [13:34:59] just silenced it [13:35:41] ah and also the jobrunners don't need to be in mediawiki-installation [13:36:22] !log CI is slowed down due to surge of jobs and lack of instances to build them on ( T133911 ). Queue is 50 for Jessie and 25 for Trusty. [13:36:23] T133911: Bump quota of Nodepool instances (contintcloud tenant) - https://phabricator.wikimedia.org/T133911 [13:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:56] no, only the service/cron/whatever has to be active [13:37:24] (03CR) 10Elukey: "The change looks really good, thanks again!" [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [13:38:16] jynus: /me learning, thanks! [13:39:00] (03CR) 10Alexandros Kosiaris: [C: 032] ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 (owner: 10Alexandros Kosiaris) [13:39:04] (03PS2) 10Alexandros Kosiaris: ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 [13:39:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 (owner: 10Alexandros Kosiaris) [13:39:17] but they have to be on the "updatable group" (dsh), which I cannot find now [13:40:34] jynus: mw1001.eqiad.wmnet is a jobrunner and it is in mediawiki-installation (DSH), and icinga is complaining about 130[01] not being in there [13:41:14] if mediawiki-installation is dsh, then yes, it must be there [13:41:28] super [13:41:33] sorry, I confuse that with the etcd pooling config [13:41:40] *got confused [13:41:41] going to reboot mw1304 and then I'll add the last 3 [13:41:59] I do not work with that very often, I am learning too [13:43:40] (03PS1) 10Elukey: Add mw130[01] to the mediawiki DSH scap list (new jobrunners) [puppet] - 10https://gerrit.wikimedia.org/r/295690 [13:43:45] to be fair, part of dsh config is on hiera and part is on modules/scap, not precisely strightforward [13:46:04] (03CR) 10Elukey: [C: 032 V: 032] Add mw130[01] to the mediawiki DSH scap list (new jobrunners) [puppet] - 10https://gerrit.wikimedia.org/r/295690 (owner: 10Elukey) [13:46:34] (03PS5) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [13:49:25] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2402049 (10ema) The initial portion of the 3WHS can be used to check whether a remote TCP server supports TFO. For example, with [[https://github.com/secdev/scapy/ | scapy]]: ``` f... [13:50:39] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:50:40] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:53:09] !log Zuul/CI are slowly catching up. I had to drop a few changes that got force merged on the SmashPig repo. [13:53:09] should be all fine [13:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:20] have to head to dentist be back later in the evening [13:58:06] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2402068 (10Krenair) a:05Jdforrester-WMF>03ori [14:05:12] _joe_: here's the ORES query that redirects to http: https://ores.wmflabs.org/v2/scores/enwiki/wp10/642215410?features [14:05:22] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2402093 (10Krenair) 05stalled>03Resolved a:03Krenair Yes, this appears to be complete now. [14:05:46] _joe_: the lack of a slash before the param is what triggers that behavior. [14:06:45] ragesock, I'm not getting the HTTP redirection. [14:07:20] Oh wait... it appears I am [14:07:21] halfak: do "curl "https://ores.wmflabs.org/v2/scores/enwiki/wp10/642215410?features" -v" [14:07:22] Woah [14:08:34] So, it looks like you make a request to https, get a 301 for http and then get a 301 for https. [14:09:09] Then the https 200 OK's [14:09:41] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Puppet has 1 failures [14:10:43] halfak: yeah. weird, huh? [14:10:57] So, I think I know what this is. [14:11:33] The web nodes know to redirect .../ to ...//, but they get the request forwarded via http [14:11:43] So they provide an http redirect. [14:12:36] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2402101 (10fgiunchedi) thanks @Cmjohnson ! I was checking again the allocation and there's a correction: row A isn't needed. Please go with 2x machines in each of B/C/D. wrt 10G vs 1G let'... [14:12:41] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [14:13:53] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2402105 (10Cmjohnson) @fgiunchedi That will be 2 each in rows A/C/D for 10G. [14:14:22] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:14:23] RECOVERY - mediawiki-installation DSH group on mw1301 is OK: OK [14:14:23] RECOVERY - mediawiki-installation DSH group on mw1300 is OK: OK [14:17:55] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC is happy at https://puppet-compiler.wmflabs.org/3174/carbon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [14:18:27] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2402115 (10elukey) Finally got the root cause of the VSL timeouts after a chat with Varnish devs. The Varnish workers use a buffer to... [14:18:40] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2402117 (10fgiunchedi) @Cmjohnson ok! let's stick with B/C/D for rows and 10G for C/D and 1G for B [14:26:37] 06Operations, 10media-storage, 07Tracking: expand swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2402130 (10fgiunchedi) [14:27:07] etherpad.wm.org seems down ? [14:27:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:28:06] it seems sometimes up, sometimes down [14:28:28] I was doing some maintenance on its passive slave [14:28:34] (03PS1) 10KartikMistry: apertium-eu-en: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/295696 (https://phabricator.wikimedia.org/T107306) [14:29:06] I have now stopped it, but the error keeps happening [14:29:17] I will restart the service [14:29:19] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2402140 (10KartikMistry) [14:29:26] well, check the errors first [14:29:51] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:30:17] "console - TypeError: Cannot set property 'timestamp' of null" [14:30:53] it is flopping [14:31:18] I am going to restart it, it is better than the current state [14:32:06] !log restarting etherpad-lite.service [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:35] is it better now? [14:33:11] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [14:34:12] akosiaris, see log [14:34:17] looking [14:34:19] it is flopping up and down [14:34:25] yeah, it's crashing [14:34:26] I tried restarting already [14:34:39] is it usually a single pad or is this new? [14:34:59] I think it's a single pad [14:35:03] lemme delete and see [14:35:30] something seems sliglty better [14:35:47] maybe it is a couple? [14:35:58] no, seems like more than one pad [14:36:05] there is no pattern [14:36:10] ah [14:36:20] (03PS1) 10KartikMistry: apertium-eu-en: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/295697 (https://phabricator.wikimedia.org/T107306) [14:36:22] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:37:10] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2402181 (10KartikMistry) [14:37:34] should I prepare the backup? [14:38:18] !log stopping etherpad-lite on etherpad1001, disabling puppet [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:32] jynus: hmmm.... [14:38:35] it could also be a connection overload due to wikimania [14:38:41] how much effort is that ? [14:38:46] I am preparing the backup just in case [14:39:00] effor not much, but it will take some time [14:39:12] I will be doing it anyway [14:39:18] jynus: https://grafana.wikimedia.org/dashboard/db/etherpad [14:39:19] on a separate db [14:39:24] users are not so many however [14:40:09] 06Operations, 10media-storage: bring swift eqiad to one zone per row - https://phabricator.wikimedia.org/T138496#2402183 (10fgiunchedi) [14:40:20] 06Operations, 10media-storage: bring swift eqiad to one zone per row - https://phabricator.wikimedia.org/T138496#2402198 (10fgiunchedi) p:05Triage>03Normal [14:41:27] let's do something [14:41:30] so, blocking all access and accessing a known pad does not crash it [14:41:32] lets recover the service [14:41:44] to test the theroy [14:41:52] by creating a blank DB [14:41:59] consider it tested [14:42:01] *to test my theory [14:42:15] I 've just blocked all access via ferm to etherpad [14:42:22] and access the SoS pad via SSH tunnel [14:42:27] it does not crash the service [14:42:34] ok, I will rename the table and create a new one [14:42:40] so, it's either something in the DB or something else [14:42:58] do you know if first install need some things already on the db? [14:43:27] no you don't [14:43:57] actually it will create everything on its own [14:44:07] ok, start the service now [14:44:15] it has a black "db" [14:44:18] *blank [14:44:32] seems like it's working [14:44:36] but of course no data [14:44:41] but that's expected [14:44:42] no prob [14:44:47] so, it's the DB that's problematic [14:44:48] I will recover now the data [14:45:10] i guess someone pasted something that it didnt like :) [14:45:14] can you add like a message? [14:45:36] modify maybe the default message? [14:45:37] !log debugging etherpad. Started the service with a blank db, looks like it's working [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:57] jynus: er, yes [14:46:03] because even if I recover, the one in the current one will be lost [14:46:10] I can recover, but probably not merge [14:46:12] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:47:11] !log change-prop deploying 05c72ed24ca [14:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:42] !log change the default message in etherpad to indicate problems [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:17] indeed I see quite a few people from wikimania [14:48:19] at least I think [14:48:22] sorry I am such an ass [14:48:25] can you add [14:48:37] "backup anything you add here as it will be deleted" [14:48:43] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:49:13] sorry for that O:-( [14:49:36] or if you just tell me where that is, I can do it :-) [14:49:37] done [14:50:08] thank you, will take it from here [14:51:11] seems like people are temporarily backing off already [14:51:20] :-/ [14:51:56] to be fair, it is not like we have a proper HA setup, or that that is needed [14:52:18] if it's db corruption indeed, that would not have helped much [14:52:26] true [14:52:45] also, we have preciselly old m1 down [14:52:48] *have [14:54:21] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2402245 (10AlexMonk-WMF) >>! In T138450#2401809, @jcrespo wrote: >>> It blows up and rebuilds all wikis on every run. >>It trunc... [14:54:27] jynus: have an ETA by any chance ? [14:54:34] or is there something I should be doing ? [14:54:40] * akosiaris feels itchy [14:55:03] * _joe_ scratches akosiaris [14:55:32] so, people have actually stopped trying to access etherpad [14:55:50] a few here and there but not the usual rate obviously [14:55:52] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2402249 (10jcrespo) > Who in the ops group could be its 'sole owner'? No one else has any access to these systems. Maybe labs a... [15:00:04] anomie, ostriches, thcipriani, marktraceur, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T1500). Please do the needful. [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:28] I can SWAT, kart_ ping me when you're around for SWAT [15:01:46] thcipriani: around. [15:01:48] akosiaris, I said it was not going to be fast :-(, I am on it [15:02:03] thcipriani: usual deploy on test host first as dblist is new this time. [15:02:05] jynus: yeah understood [15:02:08] kart_: ack [15:02:22] jynus: are you restoring a different table/db ? [15:02:47] I am not restoring yet, I am still searthing for the table [15:02:51] but I will [15:03:09] I am thinking I should rename the table back and try to make some more sense from the issue [15:03:19] should I ? [15:03:24] no [15:03:24] (03PS6) 10Thcipriani: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:03:26] or will this cause problems for you ? [15:03:34] do that on a separate instance [15:04:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:04:14] https://github.com/ether/etherpad-lite/issues/2946 [15:04:55] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295454 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:05:44] !log starting data backup of labmon1001, halting statsite/graphite/carbon-relay on system [15:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:20] kart_: patch has been pulled to mw1017 [15:06:42] thcipriani: testing.. [15:09:56] (03PS1) 10Elukey: Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster. [puppet] - 10https://gerrit.wikimedia.org/r/295702 (https://phabricator.wikimedia.org/T129963) [15:10:40] 06Operations, 06Discovery, 06Maps: Ensure Maps servers can be installed easily (automation + documentation) - https://phabricator.wikimedia.org/T138501#2402296 (10Gehel) [15:11:00] !log puppet disabled on labmon1001 along with all icinga alerting. data migration to usb in progress via root screen session [15:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:15] thcipriani: still some moar testing, 2-3 minutes please. [15:11:26] kart_: kk, np [15:11:40] robh: labmon goes well? [15:12:38] thcipriani: looks good, go ahead. [15:12:42] chasemp: so far so good, data copy in progress with --ignore-existing to try to cut down on cruft [15:12:49] kart_: ack [15:13:18] and the various services (statsite/graphite/carbon-relay) are stopped [15:13:35] but only at 3% of copy so it may still take a long time =[ [15:15:29] !log thcipriani@tin Synchronized dblists/clldefault.dblist: SWAT: [[gerrit:295454|Deploy Compact Language Links as default (Stage 2)]] PART I (duration: 00m 41s) [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:20] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:295454|Deploy Compact Language Links as default (Stage 2)]] PART II (duration: 00m 28s) [15:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:43] (03PS1) 10Yurik: Prevent geoshape service use by production [puppet] - 10https://gerrit.wikimedia.org/r/295703 [15:16:48] gehel, ^ [15:16:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295454|Deploy Compact Language Links as default (Stage 2)]] PART III (duration: 00m 24s) [15:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:58] ^ kart_ check please [15:18:07] thcipriani: testing. [15:18:41] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint: Ensure Maps servers can be installed easily (automation + documentation) - https://phabricator.wikimedia.org/T138501#2402333 (10Yurik) [15:18:52] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [15:20:48] (03CR) 10Thcipriani: [C: 031] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [15:20:56] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2402334 (10Yurik) [15:22:23] thcipriani: nice. All well. [15:22:38] kart_: glad to hear it :) [15:22:40] and thanks! [15:23:04] (03PS2) 10Yurik: Prevent geoshape service use by production [puppet] - 10https://gerrit.wikimedia.org/r/295703 [15:23:32] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:19] (03CR) 10Elukey: "Puppet compiler looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/295702 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [15:26:10] !log stop etherpad-lite, etherpad is down [15:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:58] thcipriani, swating? [15:27:10] yurik: finished [15:27:26] thcipriani, bummer, i forgot to add a small labs-only patch [15:27:37] if you don't mind, i will sync it now [15:27:40] yurik: oh, if you need a patch merged, there's still time [15:27:46] (03PS1) 10Elukey: Add mw1303 to the scap MW DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295706 [15:27:50] or you can do it :) [15:27:56] which patch? [15:28:13] https://gerrit.wikimedia.org/r/#/c/295580/1/wmf-config/CommonSettings-labs.php [15:28:15] thcipriani, ^ [15:28:36] (03CR) 10Elukey: [C: 032 V: 032] Add mw1303 to the scap MW DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295706 (owner: 10Elukey) [15:28:46] (03PS2) 10Thcipriani: LABS: Enable geoshapes graph protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295580 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [15:28:53] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295580 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [15:29:10] thx! [15:29:18] :D [15:29:34] (03Merged) 10jenkins-bot: LABS: Enable geoshapes graph protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295580 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [15:30:58] !log thcipriani@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:295580|LABS: Enable geoshapes graph protocol]] (duration: 00m 29s) [15:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:33] akosiaris: What's up with EPL? [15:32:59] marktraceur: EPL ? [15:33:13] akosiaris: Etherpad Lite [15:33:19] a etherpad lite ? it's not a great piece of software, what else ? [15:33:34] akosiaris: Just wondering, I saw you logged it went down [15:33:47] yeah, it crashes constantly [15:33:56] marktraceur: https://github.com/ether/etherpad-lite/issues/2946 [15:33:59] is the upstream issue [15:34:07] still trying to figure out what is going on [15:34:18] marktraceur: see also https://lists.wikimedia.org/pipermail/wikimania-l/2016-June/007570.html [15:34:19] Ah. [15:34:19] nothing conclusive yet, aside from what you see in that ticket [15:34:39] K, yeah, I was worried it would affect the hackathon at WM [15:37:06] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2402418 (10elukey) I have been very slow to follow up on this task due to other priorities, I'll add a summary very soon for all my findings. gerrit/295702 is a... [15:41:18] 06Operations, 10ops-eqiad, 06DC-Ops: mw1302.eqiad.wmnet issues while booting - https://phabricator.wikimedia.org/T138485#2402421 (10elukey) Also installing mw1304 leads to: Loading Linux 4.4.0-1-amd64 ... Loading initial ramdisk ... Tried to hard reboot, nothing. Not sure where it gets stuck into.. [15:41:42] (03PS1) 10RobH: setting up temp spare host for labmon1001 data migrations [dns] - 10https://gerrit.wikimedia.org/r/295711 [15:41:53] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures [15:44:53] (03PS1) 10RobH: setting WMF4724 install params [puppet] - 10https://gerrit.wikimedia.org/r/295718 [15:45:20] (03CR) 10RobH: [C: 032] setting up temp spare host for labmon1001 data migrations [dns] - 10https://gerrit.wikimedia.org/r/295711 (owner: 10RobH) [15:49:14] (03CR) 10RobH: [C: 032 V: 032] setting WMF4724 install params [puppet] - 10https://gerrit.wikimedia.org/r/295718 (owner: 10RobH) [15:50:42] PROBLEM - puppet last run on etherpad1001 is CRITICAL: Timeout while attempting connection [15:51:55] (03CR) 10Gehel: Prevent geoshape service use by production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295703 (owner: 10Yurik) [15:52:48] (03CR) 10Gehel: Prevent geoshape service use by production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295703 (owner: 10Yurik) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T1600). Please do the needful. [16:00:52] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection refused [16:01:37] no puppet swat patches afaics [16:01:52] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [16:03:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:06:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5153362 keys - replication_delay is 0 [16:07:22] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:07:54] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2402468 (10chasemp) I'm 100% on board for being on the hook for this process, or at least being a partner. We can coparent :)... [16:09:49] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2402474 (10chasemp) p:05Triage>03High [16:10:33] (03PS1) 10BBlack: r::c::perf: move all commentary inline [puppet] - 10https://gerrit.wikimedia.org/r/295722 [16:10:35] (03PS1) 10BBlack: r::c::perf: enable tcp metrics saving [puppet] - 10https://gerrit.wikimedia.org/r/295723 [16:10:38] (03PS1) 10BBlack: cache roles: add tcpmhash_entries=64K to kernel cmdline [puppet] - 10https://gerrit.wikimedia.org/r/295724 [16:11:12] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [16:12:05] (03CR) 10jenkins-bot: [V: 04-1] r::c::perf: move all commentary inline [puppet] - 10https://gerrit.wikimedia.org/r/295722 (owner: 10BBlack) [16:12:13] (03CR) 10jenkins-bot: [V: 04-1] r::c::perf: enable tcp metrics saving [puppet] - 10https://gerrit.wikimedia.org/r/295723 (owner: 10BBlack) [16:12:31] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7928 bytes in 0.017 second response time [16:12:53] (03CR) 10jenkins-bot: [V: 04-1] cache roles: add tcpmhash_entries=64K to kernel cmdline [puppet] - 10https://gerrit.wikimedia.org/r/295724 (owner: 10BBlack) [16:14:00] 16:11:22 Looking potential typos from '/typos' file [16:14:00] 16:11:30 ./modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200: fixed-address WMF4724.eqiad.wmnet; [16:14:02] 16:11:30 Typos found! [16:14:13] ^ jenkins is -1 on new puppet commits for unrelated things again... [16:15:11] yeah, https://gerrit.wikimedia.org/r/#/c/295718/ [16:15:36] yeah, the typos file is wrong in this case [16:16:00] it doesn't like WMF4724.eqiad.wmnet because it expects anything[0-9]{4}.eqiad to start the number with 1 [16:16:28] robh [16:16:31] it was merged before jenkins could vote [16:16:43] either way, jenkins' vote is faulty [16:16:47] ahh [16:16:52] did i break something? [16:17:14] robh: yeah your V+2 overrode what would've been a jenkins -1, which now applies to all future commits until it's fixed :P [16:17:21] (03CR) 10Gehel: "Yep, that error looks weird, but also appears on the production catalogue, a clear indication that it is not related to this change. Still" [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [16:17:30] bblack: wait, all future commits of other folks? [16:17:43] yes, because the merged state of the repo fails validation checks [16:17:49] fuck me sorry =[ [16:18:23] so the ideal way for me to fix is just make a single fix patch independently of the original? [16:18:26] but the fix really isn't in your commit, the "typos" validation check is in error (it's wrongly not liking your change) [16:18:33] !log swift: add ms-be202[234] weight 1000 - T136630 [16:18:34] T136630: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630 [16:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:27] bblack: so my change was good, but the check was bad. my not waiting for the check to fail has put the repo into a bad state of always validating [16:19:28] ? [16:19:41] (due to my forcing it through rather than wait) [16:19:50] argh, i never force it through and just did today ;_; [16:20:42] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [16:21:03] So what is the fix (i imagine someone is already doing it now, and i dont intend to skip validation again anytime soon cuz this.) [16:21:10] but wanna know the fix anyhow =] [16:21:15] I'm trying to figure out a fix, regexes are hards [16:21:27] sorry to break stuff =[ [16:22:00] in particular sorry to break stuff and then force you to deal with (they arent really very) regular expressions. [16:24:13] 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2402519 (10chasemp) 05Open>03Resolved there were still some files older than 90 there `-rw-r----- 1 root adm 39441687 Feb 2... [16:26:29] (03PS2) 10Gehel: Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 [16:26:59] where does the operations-puppet-typos check get defined? [16:27:18] I'm trying to figure out if whatever regex engine is going to support negative lookbehind or not [16:27:36] (03CR) 10jenkins-bot: [V: 04-1] Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [16:27:54] !log remove old log files on ytterbium for T114395 [16:27:55] T114395: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395 [16:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:12] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [16:29:55] well, worst case I break the typo check and either it still -1's everything, or it fails to check some typos and someone else has to fix it later [16:31:24] (03PS1) 10BBlack: exclude WMFNNNN.$dcname.wmnet from hostname typos [puppet] - 10https://gerrit.wikimedia.org/r/295727 [16:32:01] I wonder if updates to the typos file applied to jenkins check of the same change [16:32:47] (03CR) 10BBlack: [C: 032] exclude WMFNNNN.$dcname.wmnet from hostname typos [puppet] - 10https://gerrit.wikimedia.org/r/295727 (owner: 10BBlack) [16:32:58] apparently they do! [16:33:13] so either I broke the NNNN.$dcname typo checks completely, or I fixed them to exclude WMF, one of the two :) [16:33:30] (03PS2) 10BBlack: r::c::perf: move all commentary inline [puppet] - 10https://gerrit.wikimedia.org/r/295722 [16:33:32] (03PS2) 10BBlack: r::c::perf: enable tcp metrics saving [puppet] - 10https://gerrit.wikimedia.org/r/295723 [16:33:34] (03PS2) 10BBlack: cache roles: add tcpmhash_entries=64K to kernel cmdline [puppet] - 10https://gerrit.wikimedia.org/r/295724 [16:35:26] (03CR) 10BBlack: [C: 032] r::c::perf: move all commentary inline [puppet] - 10https://gerrit.wikimedia.org/r/295722 (owner: 10BBlack) [16:35:41] (03CR) 10jenkins-bot: [V: 04-1] cache roles: add tcpmhash_entries=64K to kernel cmdline [puppet] - 10https://gerrit.wikimedia.org/r/295724 (owner: 10BBlack) [16:36:00] I think that -1 is probably legitimate :) [16:36:36] (03PS3) 10BBlack: cache roles: add tcpmhash_entries=64K to kernel cmdline [puppet] - 10https://gerrit.wikimedia.org/r/295724 [16:44:35] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2402581 (10Dzahn) 05Open>03Resolved a:03Dzahn If it's on Done on a board, the status should also be resolved, right? [16:45:15] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, thanks Nicko for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [16:46:38] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2402597 (10Gehel) The usage in Discovery is to move tasks to Done on board and let our product owner have a final review and closing them. [16:46:53] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:47:45] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2402598 (10Dzahn) 05Resolved>03Open [16:47:45] jynus: I am currently fighting a bit with mw1304 (another jobrunner) that had problems with boot, now puppet is running.. Let me know if you see issues like before [16:48:06] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2278919 (10Dzahn) a:05Dzahn>03None [16:48:09] (03PS4) 10EBernhardson: logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [16:48:40] elukey, busy with something else, maybe someone else can help you, if not ,please wait some time [16:49:08] yes sure! I meant to tell you that I am working on mw1304, that's it :) [16:51:12] 06Operations, 10DBA, 10Wikimedia-Etherpad: etherpad database issues - https://phabricator.wikimedia.org/T138516#2402605 (10jcrespo) [16:53:06] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2402631 (10jcrespo) [16:54:45] (03PS5) 10EBernhardson: logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [16:55:29] 06Operations, 10ops-eqiad, 06DC-Ops: mw1302.eqiad.wmnet issues while booting - https://phabricator.wikimedia.org/T138485#2402641 (10elukey) Actually now, mw1304 looks weird only from the console, I managed to run puppet using install-console on palladium.. [16:59:52] (03CR) 10BBlack: [C: 032] "Manual testing on one node shows very small loadavg increase, so this seems un-dangerous to turn on and watch perf graphs for the next few" [puppet] - 10https://gerrit.wikimedia.org/r/295723 (owner: 10BBlack) [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T1700). [17:00:25] (03CR) 10BBlack: [C: 032] "Compiler output looks ok, but can't see deeply through augeaus results.." [puppet] - 10https://gerrit.wikimedia.org/r/295724 (owner: 10BBlack) [17:03:00] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2402704 (10jcrespo) p:05Triage>03High [17:03:41] !log cache perf tuning marker: start rollout of tcp_no_metrics_save:0 [17:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:51] (03CR) 10JanZerebecki: [C: 031] Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [17:26:45] cmjohnson: hi! do you have a minute? [17:26:57] Hi elukey [17:26:58] sure [17:27:06] I see your tasks about the apaches [17:27:19] app servers [17:27:39] yeah, mw1304 is a bit weird.. I am following the puppet run from palladium since I've used wmf-reimage, but I can't access the server console [17:27:43] seems stuck somewhere [17:28:05] and before that I tried to powercycle thinking that it was a boot problem [17:28:38] but same issue (stuck somewhere while booting, not output/errors) [17:28:45] can you double check? [17:28:58] maybe it is me missing something really trivial [17:29:21] when i plug in...it gives me the os prompt [17:29:30] mw1304 login: [17:29:56] !log labmon1001 cpy changed back to local usb, errors on network transfer for ownership. resumed rsync with append flag to local usb disk. [17:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:12] PROBLEM - HP RAID on ms-be2022 is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. [17:30:43] i think the serial console is not set correctly [17:32:30] will deploy new version of parsoid shortly ... [17:35:42] cmjohnson: all right, that kinda makes sense, after your "mw1304 login:" I felt a bit frustrated :P [17:36:41] (afk for ~30 mins) [17:36:51] elukey: fixed [17:37:01] !log starting parsoid deploy [17:37:04] Debian GNU/Linux 8 mw1304 ttyS1 [17:37:04] mw1304 login: [17:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:15] cmjohnson: thanks!!!! [17:37:22] what was the issue? [17:37:32] I mean, can I recognize it in the future and fix it by myself? [17:38:21] no, it's a setting that I got wrong when I initially racked them [17:40:23] (03PS2) 10Gehel: LABS: Enable graphoid geoshapes [puppet] - 10https://gerrit.wikimedia.org/r/295581 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [17:40:51] !log synced new code; restarted parsoid on wtp1001 as a canary [17:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:22] lgtm. restarting on all nodes [17:44:17] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2286195 (10ksmith) Thanks to everyone who helped get this unstuck and fixed! [17:44:23] (03CR) 10Gehel: "This is only deployment-prep configuration. Can be merge as-is. Conversation continues for the prod part." [puppet] - 10https://gerrit.wikimedia.org/r/295581 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [17:44:45] (03CR) 10Gehel: [C: 032] LABS: Enable graphoid geoshapes [puppet] - 10https://gerrit.wikimedia.org/r/295581 (https://phabricator.wikimedia.org/T138192) (owner: 10Yurik) [17:45:24] !log finished deploying parsoid sha 18022c96 [17:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:43] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:43] 06Operations, 06Community-Liaisons, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-xx 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2402815 (10Aklapper) [17:53:24] 06Operations, 13Patch-For-Review: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2402822 (10Aklapper) In reply to T138197#2395473: See task summary: "semi-private staging area" [17:55:12] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2402828 (10jcrespo) [17:55:55] cmjohnson: got it thanks! [17:57:44] (03PS4) 10Jdlrobson: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) [18:02:37] (03PS1) 10Elukey: Add mw1304 to the MW scap DSH list [puppet] - 10https://gerrit.wikimedia.org/r/295740 [18:04:19] (03CR) 10Elukey: [C: 032] Add mw1304 to the MW scap DSH list [puppet] - 10https://gerrit.wikimedia.org/r/295740 (owner: 10Elukey) [18:06:48] (03PS6) 10EBernhardson: logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [18:09:47] !log labmon1001 powering down for reimage [18:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:07] cmjohnson: Ok, its all on you once labmon1001 powers down. Please set aside the old disks in order, in case I messed up and we have to fall back to them. [18:10:29] then lemme know when the new ones are in and ready to go, and i can reimage it and restore data. [18:11:29] (03CR) 10Ori.livneh: [C: 031] Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster. [puppet] - 10https://gerrit.wikimedia.org/r/295702 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [18:12:00] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. [18:12:10] PROBLEM - HP RAID on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. [18:14:06] (03PS7) 10EBernhardson: logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [18:14:08] (03PS6) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [18:15:21] (03CR) 10jenkins-bot: [V: 04-1] logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [18:15:21] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:15:25] (03CR) 10jenkins-bot: [V: 04-1] Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 (owner: 10EBernhardson) [18:17:21] (03PS8) 10EBernhardson: logstash: Update filters for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [18:17:23] (03PS7) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [18:18:15] (03PS2) 10Muehlenhoff: Update debdeploy config for maps caches [puppet] - 10https://gerrit.wikimedia.org/r/295211 [18:20:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update debdeploy config for maps caches [puppet] - 10https://gerrit.wikimedia.org/r/295211 (owner: 10Muehlenhoff) [18:26:25] (03PS9) 10EBernhardson: logstash: Update logstash for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [18:27:30] (03PS1) 10Urbanecm: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295744 [18:31:44] !log mw130[0134] - new jobrunners installed and pooled (happened automatically after the fist puppet run) [18:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:53] ok just finished mw1304, seems to work fine [18:35:12] going afk, just logged --^ a summary of the new jobrunners [18:35:33] I hoped to have some explicit pool action, but it seems embedded in puppet [18:35:41] anyhow, logs looks good [18:35:57] let me know if anything weird comes up during the next hours! [18:49:37] 06Operations, 06Discovery, 10Kartotherian, 06Maps: Maps - enable Geoshapes on production - https://phabricator.wikimedia.org/T138525#2402939 (10Gehel) [18:49:59] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2402952 (10Gehel) [18:50:01] 06Operations, 06Discovery, 10Kartotherian, 06Maps: Maps - enable Geoshapes on production - https://phabricator.wikimedia.org/T138525#2402951 (10Gehel) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T1900). [19:01:23] hold your horses. Holding train for the moment while some patches are deployed. [19:01:54] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2400728 (10scfc) I once thought of a tool that does something like `diff -u <(mysqldump --no-data) <(what-views-and-triggers-and... [19:05:20] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2402968 (10jcrespo) @scfc redactatron is a horrible piece of software and we do not want to expand it, but kill it. It has its f... [19:21:47] !log Synced patches for T137288 and T137593 [19:23:53] delete the leading space :) [19:24:20] !log 19:21 < RoanKatto> !log Synced patches for T137288 and T137593 [19:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:59] (03PS1) 10Thcipriani: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295747 [19:27:40] (03CR) 10Thcipriani: [C: 032] all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295747 (owner: 10Thcipriani) [19:28:18] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295747 (owner: 10Thcipriani) [19:29:01] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.7 [19:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:20] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:55:47] Abuse filter does not seem happy after rolling forward :\ https://phabricator.wikimedia.org/T138529 + https://phabricator.wikimedia.org/T138528 [19:56:14] I guess it is some rule on enwiki which ends up triggering the flow of notices [19:56:19] they are probably easy fix [19:59:42] thcipriani: mind if I deploy a config change? [19:59:59] jzerebecki: go ahead [20:00:21] (03CR) 10JanZerebecki: [C: 032] Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [20:01:40] (03PS2) 10JanZerebecki: Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [20:01:48] (03CR) 10JanZerebecki: [C: 032] Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [20:02:28] (03Merged) 10jenkins-bot: Log PHP/HHVM errors in CLI mode to stderr, not stdout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295554 (https://phabricator.wikimedia.org/T138291) (owner: 10Hoo man) [20:03:03] !log labmon1001 data restore at 100gb 50minutes in, 298gb total for restoration [20:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:29] !log jzerebecki@tin Synchronized wmf-config/CommonSettings.php: Log PHP/HHVM errors in CLI mode to stderr, not stdout T138291 (duration: 00m 28s) [20:04:30] T138291: Latest wikidata JSON dump contains unexpected sql warning - https://phabricator.wikimedia.org/T138291 [20:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:06] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2403090 (10AlexMonk-WMF) >>! In T138450#2401133, @jcrespo wrote: > I have to add a view to a newly created labs-only table, so i... [20:05:11] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:09:36] (03CR) 10Dzahn: [C: 031] Add Amiri font to the scalers [puppet] - 10https://gerrit.wikimedia.org/r/295498 (https://phabricator.wikimedia.org/T135347) (owner: 10Muehlenhoff) [20:12:02] done [20:17:02] !log Run initSiteStats.php on cebwiki (T138533) [20:17:03] T138533: Update statistics count on cebwiki - https://phabricator.wikimedia.org/T138533 [20:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:43] (03PS1) 10Alex Monk: Replace impossible watchlist_counts custom view with full view of already-filtered watchlist_count [software] - 10https://gerrit.wikimedia.org/r/295751 [20:19:30] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:48] (03CR) 10EBernhardson: "test deployed to beta cluster, looks to be working with no warnings/errors." [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/295575 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [20:20:13] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2403157 (10AlexMonk-WMF) @jcrespo: I was wrong in my last comment and have uploaded https://gerrit.wikimedia.org/r/295751 which,... [20:21:25] (03CR) 10BryanDavis: [C: 032] Add de_dot filter and rename to logstash-filters-wikimedia [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/295575 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [20:21:29] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:21:49] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [20:21:55] ebernhardson: Do you want to deploy that now or should we wait for the filters that use it? [20:24:37] bd808: hmm, might as well wait i suppose. I think i'll cherry pick the patch back to master though, the first patch for making deployment-logstash3 work might not ever even need proper merging, just shutdown the host and remove the patch from deployment-puppetmaster [20:24:45] s/master/production/ [20:24:58] the puppet also seems to be working, but still testing things [20:26:26] works for me. we should try not to forget that a trebuchet deploy is needed before the de_dot filter can be used in prod [20:26:43] ahh thats right, i guess i'll make it easy and just sync it out now without restarting logstash [20:27:11] *nod* that should be safe [20:28:06] bd808: no jenkins on that repo btw, needs v+2 and merge [20:28:23] jynus: re T137058 – is it simply a matter of getting around the production/labs split by storing the data in analytics instead and having that be the data store? [20:28:24] T137058: Investigation: MediaWiki extension for database reports - https://phabricator.wikimedia.org/T137058 [20:28:27] doh. I can do that [20:28:52] (03CR) 10BryanDavis: [V: 032] Add de_dot filter and rename to logstash-filters-wikimedia [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/295575 (https://phabricator.wikimedia.org/T138335) (owner: 10EBernhardson) [20:29:27] 06Operations, 06Discovery, 10Kartotherian, 06Maps: Maps - enable Geoshapes on production - https://phabricator.wikimedia.org/T138525#2403184 (10Yurik) [20:31:04] !log synced out latest logstash-plugins via trebuchet [20:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:00] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403185 (10jcrespo) So good news: we have been able to recover until just a few minutes before crashing (which means virtually no data loss). The problem is we have yet to reimpor... [20:32:47] harej, what do you mean with analytics? [20:33:19] "Another option is to send labs data to a specialized analytics store, where creating reports on the fly would be much easier and faster." [20:34:06] While having a MediaWiki extension that pulls directly from the production DB is unacceptable, and likewise pulling from the Labs replicas is unacceptable, it sounds like putting the data in analytics is acceptable? [20:35:18] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2403187 (10AlexMonk-WMF) >>! In T138450#2402468, @chasemp wrote: > * For [[ https://phabricator.wikimedia.org/T135029#2400629 |... [20:36:28] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403189 (10Effeietsanders) Can you please make sure to not overwrite the things added later? I re-did a bunch of the work I did this afternoon in preperation of the discussions tom... [20:40:31] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403193 (10jcrespo) @Effeietsanders as I sent on my email- no data will be added, deleted or overwritten on the current etherpad. **I promised that and I will maintain that.** We t... [20:42:21] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403195 (10jcrespo) Clarification: no data will be added, deleted or overwritten on the current etherpad **by us (operators)**, you are expected to do that as usual (use the curren... [20:42:26] (03PS10) 10EBernhardson: logstash: Update logstash for sending to es 2.x [puppet] - 10https://gerrit.wikimedia.org/r/295578 (https://phabricator.wikimedia.org/T138335) [20:47:30] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:47:57] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2403204 (10jcrespo) @scfc BTW, I actually documented [[ https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs | red... [20:53:00] 06Operations, 10DBA, 10Wikimedia-Etherpad, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403209 (10jcrespo) @akosiaris I managed to reimport the tables, with two different timestamps. They are on the same host (m1-master), and I have granted permission to the same use... [21:08:53] PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/project/maps 4161047 MB (0% inode=-)] [21:09:21] jynus: ok after some mucking around I have a second instance using the restore_2 DB and using a different port [21:09:36] lemme make it a bit more permanent and fix the rest [21:10:56] maybe try it first with ssh? [21:11:27] I have like a 90% confidence on 1 and a 50% on 2 [21:11:56] !log silence alerts for labstore1004 for setup [21:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:02] akosiaris, but if you make it work, with all the beers I own joe and chris, and now you I will get broke! [21:16:42] there's a pretty serious save-timing regression that kicks off around 19:20-ish [21:16:54] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [21:18:11] it's either sync-wikiversions at circa 19:20, or maybe the sync-dir at 19:15, which I guess might be 19:21 < RoanKattouw> !log Synced patches for T137288 and T137593 [21:18:29] I thought we used to get these echo'd to -ops? (the sync traffic) [21:18:29] restbase math server === mathoid ? [21:19:58] I guess they are, but the stamps in grafana don't match the stamps in -ops. perhaps one's at the start and the other at the end [21:19:59] (03PS1) 10Alexandros Kosiaris: cache::misc: Set up a temporary etherpad host [puppet] - 10https://gerrit.wikimedia.org/r/295757 (https://phabricator.wikimedia.org/T138516) [21:20:27] bblack: I'd appreciate a review of ^ [21:20:36] bblack: from the look of it, it's probably related to the wikiversion sync [21:20:40] anyways, still, it's probably RK's "synced patches" or 19:29 < logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.7 [21:21:37] (03CR) 10BBlack: [C: 031] cache::misc: Set up a temporary etherpad host [puppet] - 10https://gerrit.wikimedia.org/r/295757 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [21:21:51] I'm going to guess that the move to wmf.7 had a pretty big impact. I'll rollback. [21:22:01] bblack: thanks! [21:22:04] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [21:22:04] ACKNOWLEDGEMENT - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js alexandros kosiaris See T138516 as to why there are currently 2 instances [21:22:15] the save timing regression is ~ +30%, it's pretty bad [21:22:47] jynus: I owe you beers over this as well so we are going to get even ;-) [21:22:59] ori, ^ [21:23:00] (03PS2) 10BBlack: stream.wm.o: move to cache_misc in DNS [dns] - 10https://gerrit.wikimedia.org/r/295385 (https://phabricator.wikimedia.org/T134871) [21:23:09] 07Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2403326 (10ori) >>! In T138450#2403187, @AlexMonk-WMF wrote: >>>! In T138450#2402468, @chasemp wrote: >> * For [[ https://phabri... [21:23:10] (03CR) 10Alexandros Kosiaris: [C: 032] cache::misc: Set up a temporary etherpad host [puppet] - 10https://gerrit.wikimedia.org/r/295757 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [21:23:14] (03PS2) 10Alexandros Kosiaris: cache::misc: Set up a temporary etherpad host [puppet] - 10https://gerrit.wikimedia.org/r/295757 (https://phabricator.wikimedia.org/T138516) [21:23:25] MaxSem: thanks for the ping, catching up with backlog now. [21:23:29] (03CR) 10Alexandros Kosiaris: [V: 032] cache::misc: Set up a temporary etherpad host [puppet] - 10https://gerrit.wikimedia.org/r/295757 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [21:23:41] what's the tl;dr? bad regression, coincides with wmf7 release? [21:23:48] ori: yes [21:23:49] ty ori [21:23:57] but there's some other minor changes around that time, too [21:24:16] I only see the big hit in savetiming, not other metrics that I looked at so far [21:24:27] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 wikis to wmf.6 [21:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:48] that other ones aren't typically sensitive to backend response time, since the majority of requests are served from varnish [21:24:53] seems that way, I just rolled back group2 wikis that went out today [21:25:04] thanks [21:25:11] the hourly flame graphs are usually useful (https://performance.wikimedia.org/xenon/svgs/hourly/) [21:26:12] (03PS1) 10Thcipriani: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295758 [21:26:56] bblack, ori: Sorry, my laptop died, only seeing this now [21:26:57] (03CR) 10Thcipriani: [C: 032] Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295758 (owner: 10Thcipriani) [21:27:07] The "patches" I synced were security patches, see the bugs I tagged [21:27:24] I don't think offhand that they should be able to cause save time regressions but let me skim them [21:27:34] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295758 (owner: 10Thcipriani) [21:28:01] yeah, judging from https://grafana.wikimedia.org/dashboard/db/save-timing there is a pretty strong correlation with sync-wikiversions i.e. wmf.7 [21:28:07] other thing I'm doing is looking at xenon-grep [21:28:08] doubt it was the security patches [21:28:19] https://dpaste.de/PR23/raw [21:28:23] Nope, they are not at all related to saving [21:29:08] actually -2: is better [21:29:56] https://dpaste.de/aHpY/raw [21:29:59] ApiStashEdit::checkCache looks suspect [21:30:14] 1.19% -> 8.33 [21:30:36] Aaron's made some changes to that recently [21:31:50] no diff b/w php-1.28.0-wmf.[67]/includes/api/ApiStashEdit.php , but calling code could have changed [21:32:05] that doesn't show very prominently in https://performance.wikimedia.org/xenon/svgs/hourly/2016-06-23_20.index.reversed.svgz (0.4%) [21:32:23] (03PS2) 10Gehel: (WIP) Notify TileratorUI on new expiry files [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) (owner: 10Yurik) [21:33:18] (03PS1) 10Alexandros Kosiaris: Set up etherpad-restore.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/295760 (https://phabricator.wikimedia.org/T138516) [21:34:20] (03CR) 10Alexandros Kosiaris: [C: 032] Set up etherpad-restore.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/295760 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [21:36:22] bd808: that's for all index.php reqs, whereas the xenon-grep invocation filters for traces that include EditPage [21:36:35] *nod* [21:37:44] (03CR) 10Gehel: "Already putting this out there for review, but it's too late, I probably missed something obvious." [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) (owner: 10Yurik) [21:39:48] 06Operations, 10DBA, 10Wikimedia-Etherpad, 13Patch-For-Review, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403399 (10akosiaris) Thanks to @jcrespo 's efforts and using the `etherpadlite_restore2` database, we now have http://etherpad-restore.wikimedia.org. This is... [21:41:05] (03CR) 10Yurik: [C: 04-1] "wow, lots of nice improvements :) Made a few minor comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) (owner: 10Yurik) [21:42:19] page save time is dropping back down [21:43:18] (03PS3) 10BBlack: Remove old maps-test servers from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [21:44:10] (03CR) 10BBlack: [C: 031] "Correct, "puppet-merge" will invoke "conftool-merge" to remove the servers from the lists pybal uses, no explicit action on LVSes is requi" [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [21:44:55] (03CR) 10Gehel: "BBlack: thanks! Will merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/295640 (owner: 10Gehel) [21:46:53] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:47:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice!. minor inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295450 (https://phabricator.wikimedia.org/T108459) (owner: 10Yurik) [21:48:07] yeah, the edit stash hit ratio track yesterday's until the deployment and then dipped: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1466718428.219&from=-4hours&target=alias(asPercent(sumSeries(MediaWiki.editstash.cache_hits.*.rate)%2C%20sumSeries(MediaWiki.editstash.cache_%7Bhits%2Cmisses%7D.*.rate))%2C%22edit%20stash%20hit%20%25%22)&target=alias(timeShift(asPercent(sumSeries(MediaWiki.editstash.cache_hits.*.r [21:48:07] ate)%2C%20sumSeries(MediaWiki.editstash.cache_%7Bhits%2Cmisses%7D.*.rate))%2C%20%221d%22)%2C%20%22edit%20stash%20hit%20%25%2C%20-1d%22) [21:48:45] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [21:48:51] I'd paste the short URL but I can't because Graphite is broken ("Graphite encountered an unexpected error while handling your request." in the top frame, and short urls point to 127.0.0.1) [21:48:54] anyone know what that's about? [21:49:34] * ori guesses yuvi / I22f45a80e834ab1a686fe09c8ce64da19380dbaa [21:50:44] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [21:58:26] 07Blocked-on-Operations, 06Operations, 07Graphite: "unexpected error" on graphite-web - https://phabricator.wikimedia.org/T138541#2403434 (10ori) [21:58:51] hmm, so the suspicion is https://gerrit.wikimedia.org/r/#/c/295023/1 ? [21:59:10] thcipriani: no, that one is on both 7 and 6 [21:59:34] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [22:01:13] it might not be edit stash related at all, since page save time is back down, but the stash hit rate is not [22:01:36] (03PS5) 10Gehel: Move es-tool to a proper python package [puppet] - 10https://gerrit.wikimedia.org/r/290765 [22:02:50] (03CR) 10jenkins-bot: [V: 04-1] Move es-tool to a proper python package [puppet] - 10https://gerrit.wikimedia.org/r/290765 (owner: 10Gehel) [22:03:27] thcipriani: I doubt I'll get to the bottom of it in the few minutes I have before I need to go. Is it all right to leave wikis on wmf.6 for now? [22:04:06] ori: yes. probably for the best if the solution is unknown and the save time is returning to normal since the rollback. [22:05:29] abusefilter logspam was almost to a rollback tipping point anyway. [22:09:51] (03PS1) 10Alexandros Kosiaris: Introduce etherpad100b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/295763 (https://phabricator.wikimedia.org/T138516) [22:10:15] (03CR) 10jenkins-bot: [V: 04-1] Introduce etherpad100b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/295763 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [22:10:54] ignore that [22:11:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "jenkins seems to have croaked. Removing and self merging" [dns] - 10https://gerrit.wikimedia.org/r/295763 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [22:12:03] (03PS1) 10Alexandros Kosiaris: etherpad-restore: Use etherpad1001b [puppet] - 10https://gerrit.wikimedia.org/r/295764 (https://phabricator.wikimedia.org/T138516) [22:12:32] !log powercycle labstore1005 [22:12:32] (03PS2) 10Alexandros Kosiaris: etherpad-restore: Use etherpad1001b [puppet] - 10https://gerrit.wikimedia.org/r/295764 (https://phabricator.wikimedia.org/T138516) [22:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad-restore: Use etherpad1001b [puppet] - 10https://gerrit.wikimedia.org/r/295764 (https://phabricator.wikimedia.org/T138516) (owner: 10Alexandros Kosiaris) [22:12:53] OK -- I have to go. If no one else files a task, I will file one when I get back in a couple of hours. Thanks for spotting that and rolling back, and thanks for the ping. [22:13:02] ^ thcipriani [22:13:24] ack [22:13:37] thanks for looking into it [22:15:01] 06Operations, 10Traffic: Backend naming in VCL needs to use fqdn+port - https://phabricator.wikimedia.org/T138546#2403529 (10BBlack) [22:15:54] PROBLEM - carbon-cache@b service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive [22:16:14] PROBLEM - carbon-cache@c service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive [22:16:25] PROBLEM - carbon-cache@d service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive [22:16:44] PROBLEM - carbon-cache@e service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive [22:17:03] PROBLEM - carbon-cache@f service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive [22:17:13] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: Connection refused [22:17:15] PROBLEM - carbon-cache@g service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive [22:17:16] robh: I guess not silenced^? [22:17:29] ahhh, for the old window, then removed from pupet [22:17:31] lemme fix [22:17:43] PROBLEM - carbon-cache@h service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive [22:17:44] PROBLEM - carbon-cache@a service on labmon1001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive [22:18:53] ok, they are in maint mode until tomrrow 1700gmt [22:19:17] so no more irc spam. the issue is the maint isnt sticky when you reinstall a host and remove it from icinga [22:19:18] heh [22:22:37] (03PS6) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) [22:22:44] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [22:33:07] !log reimage labstore1005 post io testing [22:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:34] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [22:34:14] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:40:24] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [22:44:49] 06Operations, 10DBA, 10Wikimedia-Etherpad, 13Patch-For-Review, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403642 (10jcrespo) It is finally working: https://etherpad-restore.wikimedia.org (if it does not, wait for your DNS cache to update). Please recover anythin... [22:46:16] 06Operations, 10DBA, 10Wikimedia-Etherpad, 13Patch-For-Review, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2403643 (10jcrespo) [22:52:14] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160623T2300). Please do the needful. [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] here /o [23:02:52] (03CR) 10MaxSem: [C: 04-1] Complete list of legacy main pages, switch default to false (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:03:39] MaxSem: why sorted? It's actually more useful to group by project [23:04:05] it's not sorted even within one project. also, other dblists are sorted [23:05:13] okay.. but is there any reason other than readability (just so i understand motivation)? [23:06:26] readability is a pretty important one [23:06:46] I don't think it's technically required [23:07:19] i'm just thinking how best to sort it [23:07:39] cat | sort > file [23:07:41] it would be preferable to have projects sorted by languages but not sure how easy that would be to achieve... [23:07:51] coherence with other dblist is a pretty good idea too [23:07:57] (i know i can sort it that way... but that leads to all the zh's together) [23:08:08] which makes it harder to divide and conquer this list one community at a time [23:08:36] that's how all te other lists are sorted [23:08:58] jdlrobson: the dblist is not a good management todo list tool [23:09:23] (03PS5) 10Jdlrobson: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) [23:09:26] jdlrobson: grep wikt myaweesome.dblist [23:09:28] i dont care enough so i just sorted :) [23:09:38] new patch is up [23:09:44] (03CR) 10jenkins-bot: [V: 04-1] Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:09:57] (03PS6) 10Jdlrobson: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) [23:10:30] wait smething went wrong [23:10:42] (03CR) 10Jdlrobson: [C: 04-1] Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:11:17] why -1? [23:11:37] i think it's okay [23:11:45] i just read the diff wrong - i thought it had removed some items in the sort [23:11:50] but they seem to be there [23:11:52] then remove it:) [23:11:57] i have [23:12:03] it's okay to merge again :) [23:12:34] (03PS7) 10MaxSem: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:12:43] (03CR) 10MaxSem: [C: 032] Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:13:29] (03Merged) 10jenkins-bot: Complete list of legacy main pages, switch default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295600 (https://phabricator.wikimedia.org/T138425) (owner: 10Jdlrobson) [23:14:55] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:15:05] !log maxsem@tin Synchronized dblists/mobilemainpagelegacy.dblist: https://gerrit.wikimedia.org/r/#/c/295600/ (duration: 00m 28s) [23:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:59] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/295600/ (duration: 00m 29s) [23:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:04] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:17:15] jdlrobson, ^ [23:17:21] on it [23:25:00] MaxSem: looks good to me (As best as i can test - can't find any examples where it broke things) [23:25:05] thank you! [23:28:12] :) [23:45:05] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: puppet fail