[00:00:25] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293653 (10Pchelolo) Not sure if that's related, but today we've noticed then when some te... [00:01:00] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [00:01:53] 06Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3293658 (10Papaul) p:05Normal>03Low [00:14:20] thcipriani: i18n strings take a long time to regenerate / get through caches, right? [00:15:27] ejegg: iirc unless they are ?touched? Its normal caching [00:16:21] gotcha [00:16:21] ejegg: not sure what the caching is like for l10n stuff, should be fairly immediate iirc, the cdb files are being rebuilt on all servers right now and should be done Soon™ :) [00:16:39] !log thcipriani@tin Finished scap: SWAT: [[gerrit:355660|Fix version of DonationInterface deployed to donatewiki]] T166302 (duration: 19m 44s) [00:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:48] T166302: Missing interface messages on donatewiki - https://phabricator.wikimedia.org/T166302 [00:16:51] ^ ejegg new messages should be live everywhere [00:17:43] thcipriani: can i buy Soon(TM) from you xD [00:18:05] Soon™® [00:18:28] ☹️ [00:20:30] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2009 is OK: OK: synced at Fri 2017-05-26 00:20:28 UTC. [00:22:06] ty for the clarification thcipriani|afk [00:22:18] * thcipriani|afk doffs hat [00:37:09] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293693 (10Pchelolo) > Was that due to moves, or due to undeletions? According to the com... [00:41:04] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293695 (10kaldari) >Do you know if this was affecting all Wikipedias, or just certain one... [01:12:20] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:10] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 78768 bytes in 0.344 second response time [01:29:23] !log bsitzmann@tin Started deploy [mobileapps/deploy@a8d0c91]: Update mobileapps to db6493c [01:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:18] jouncebot: refresh [01:30:20] I refreshed my knowledge about deployments. [01:30:49] bd808: MatmaRex what happened? [01:33:08] !log bsitzmann@tin Finished deploy [mobileapps/deploy@a8d0c91]: Update mobileapps to db6493c (duration: 03m 45s) [01:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:59] addshore: what are you referring to? [01:43:19] addshore: my pings of you were about some bugs SMalyshev found in the core statsd code. [01:43:46] He's got a patch up in gerrit that I haven't looked at yet [01:44:07] addshore: ? [01:44:35] i think you wanted to ping someone else [02:25:57] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 31s) [02:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:11] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 59s) [02:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:48] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 26 02:58:48 UTC 2017 (duration 6m 37s) [02:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:59] (03PS1) 10Dzahn: stop using exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355741 [03:51:31] (03PS2) 10Dzahn: stop using exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355741 [03:59:10] (03CR) 10Dzahn: [C: 04-1] "the existing metrics in ganglia are kind of hard to find but there, have to click "no_group metrics". https://ganglia.wikimedia.org/latest" [puppet] - 10https://gerrit.wikimedia.org/r/355741 (owner: 10Dzahn) [04:03:33] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3293783 (10Dzahn) I uploaded a patch to stop using exim4::ganglia plugin when i saw it play a role in the spam-from-labs issue in T166322... [04:09:30] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=396.20 Read Requests/Sec=454.40 Write Requests/Sec=0.60 KBytes Read/Sec=37102.00 KBytes_Written/Sec=16.00 [04:17:54] ACKNOWLEDGEMENT - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3024.20 Read Requests/Sec=1290.90 Write Requests/Sec=0.50 KBytes Read/Sec=38226.80 KBytes_Written/Sec=20.80 daniel_zahn these are (read) spikes - needs better threshold or be removed [04:19:30] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=19.20 Read Requests/Sec=0.30 Write Requests/Sec=0.50 KBytes Read/Sec=2.00 KBytes_Written/Sec=19.20 [04:52:25] (03PS1) 10Dzahn: wikistats: install php7.0-xml if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/355742 [05:49:40] 06Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293855 (10Marostegui) >>! In T166344#3293358, @Volans wrote: > I've ack'ed the Icinga alarm with this task. > > I've also forced a BBU learn cycle on db1016, it was looking good during the cy... [05:50:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:50:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:51:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:51:12] 06Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293868 (10Marostegui) It is now showing Optimal again: ``` BatteryType: BBU Voltage: 4074 mV Current: 0 mA Temperature: 32 C Battery State: Optimal BBU Firmware Status: Charging Status... [05:51:50] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:52:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:56:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:56:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 10 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:58:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:02:03] (03PS1) 10Marostegui: db-codfw.php: Repool db2057, depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355744 (https://phabricator.wikimedia.org/T166278) [06:05:38] !log Resume pt-table-checksum on s1 - T162807 [06:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:49] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:06:07] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2057, depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355744 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:07:35] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2057, depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355744 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:07:44] (03CR) 10jenkins-bot: db-codfw.php: Repool db2057, depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355744 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:07:59] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293304 (10Joe) So to summarize this succintly, I'll post the list of request on a random... [06:09:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2057, depool db2050 - T166278 (duration: 00m 56s) [06:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:23] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:10:15] !log Deploy alter table on s3 - db2050 - T166278 [06:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:59] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293894 (10Joe) To be more clear: there is exactly 0% probability this was caused by somet... [06:22:45] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293304 (10Marostegui) I believe this would merit an incident report to find out exactly w... [06:25:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [06:25:10] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [06:25:15] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293926 (10Joe) >>! In T166345#3293922, @Marostegui wrote: > I believe this would merit an... [06:33:48] (03PS1) 10Marostegui: db-eqiad.php: Repool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355745 (https://phabricator.wikimedia.org/T166206) [06:35:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355745 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:37:28] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355745 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:37:35] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355745 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:38:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097 - T166206 (duration: 00m 40s) [06:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:03] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:44:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:44:10] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:07:25] (03CR) 10Faidon Liambotis: [C: 04-1] "This is not the root cause for the issue you mentioned, but we should eventually do it nevertheless. I'd like to see something that replac" [puppet] - 10https://gerrit.wikimedia.org/r/355741 (owner: 10Dzahn) [07:07:55] (03Abandoned) 10Faidon Liambotis: cassandra/aqs: drop Hiera values equal to defaults [puppet] - 10https://gerrit.wikimedia.org/r/354081 (owner: 10Faidon Liambotis) [07:10:10] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [07:18:10] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 61814 MB (12% inode=99%) [07:22:10] RECOVERY - Disk space on elastic1023 is OK: DISK OK [07:28:52] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3293963 (10jcrespo) Thank you, trying to reimage again. [07:34:50] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:40] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [07:37:10] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 41, down: 0, shutdown: 4 [07:39:00] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Transit: NTT (service ID 253065) {#11401} [10Gbps]BR [07:43:00] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [07:45:10] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active [07:49:10] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 43, down: 0, shutdown: 2 [07:49:21] (03PS1) 10Marostegui: check_private_data_report: Add Jaime's email [puppet] - 10https://gerrit.wikimedia.org/r/355748 [07:53:11] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3293978 (10fgiunchedi) @Dzahn correct there's no replacement yet, the easiest ATM is likely to use `extendedeximcollector` for diamond we... [08:06:53] (03PS2) 10Jcrespo: jynus-vimrc: Disable mouse input & enable syntax highlighting [puppet] - 10https://gerrit.wikimedia.org/r/355595 [08:08:57] (03CR) 10Jcrespo: [C: 032] jynus-vimrc: Disable mouse input & enable syntax highlighting [puppet] - 10https://gerrit.wikimedia.org/r/355595 (owner: 10Jcrespo) [08:09:00] 06Operations, 06DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368#3293994 (10fgiunchedi) [08:10:21] 06Operations, 13Patch-For-Review: provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#836247 (10fgiunchedi) [08:10:21] 06Operations, 13Patch-For-Review: support big files in atftpd - https://phabricator.wikimedia.org/T87804#3294023 (10fgiunchedi) 05Open>03Resolved I believe this is fixed now, all tftp-related machines are running jessie [08:12:10] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:13:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [08:13:02] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: add redis instances for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) [08:13:41] <_joe_> elukey: ^^ [08:14:11] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:17:08] (03CR) 10Jcrespo: [C: 031] check_private_data_report: Add Jaime's email [puppet] - 10https://gerrit.wikimedia.org/r/355748 (owner: 10Marostegui) [08:20:01] (03PS1) 10Jcrespo: install_server: Reimage db2049 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/355752 (https://phabricator.wikimedia.org/T165739) [08:21:33] _joe_ checking [08:21:44] <_joe_> elukey: https://puppet-compiler.wmflabs.org/6540/ [08:23:11] (03CR) 10Jcrespo: [C: 032] install_server: Reimage db2049 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/355752 (https://phabricator.wikimedia.org/T165739) (owner: 10Jcrespo) [08:23:21] I was about to pcc it :) [08:24:16] 06Operations, 10Monitoring, 06Operations-Software-Development: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3294049 (10Volans) [08:24:58] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3288272 (10Volans) So it seems that those flapping results are due to puppet running ALSO as a daemon on those hosts (thanks @faidon ), because if at any time when running a puppet agent there is a typo in the options arou... [08:25:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:25:10] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:25:45] (03CR) 10Elukey: [C: 031] role::jobqueue_redis: add redis instances for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) (owner: 10Giuseppe Lavagetto) [08:26:00] also checked https://puppet-compiler.wmflabs.org/6542/rdb1005.eqiad.wmnet/ just to be super sure, everything looks good [08:26:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 8 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:26:08] 06Operations, 10Monitoring, 06Operations-Software-Development: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3294049 (10Volans) [08:31:41] (03PS1) 10Faidon Liambotis: check_cpufreq: various cleanups (typos, shell etc.) [puppet] - 10https://gerrit.wikimedia.org/r/355753 [08:35:24] 06Operations, 10Monitoring, 06Operations-Software-Development: Monitoring: create an alert for daemonized puppet - https://phabricator.wikimedia.org/T166371#3294070 (10faidon) a:03Dzahn [08:35:35] (03CR) 10Faidon Liambotis: [C: 032] check_cpufreq: various cleanups (typos, shell etc.) [puppet] - 10https://gerrit.wikimedia.org/r/355753 (owner: 10Faidon Liambotis) [08:40:04] <_joe_> paravoid: I think the OK/CRITICAL/UNKNOWN there are redundant btw [08:40:21] hm, may be [08:40:35] * _joe_ off for now [08:45:30] !log killed daemonized puppet on tegmen, lvs1006 T166203 [08:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:40] T166203: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203 [08:46:03] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3294073 (10Volans) 05Open>03Resolved [08:50:23] (03PS6) 10Filippo Giunchedi: prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007 [08:50:25] (03PS5) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457 [08:50:27] (03PS5) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [08:50:29] (03PS5) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [08:50:31] (03PS4) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [08:50:33] (03PS5) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [08:51:28] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/354007 (owner: 10Filippo Giunchedi) [08:52:18] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [08:54:23] 06Operations: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3294083 (10Volans) [08:57:52] (03PS2) 10Marostegui: check_private_data_report: Add Jaime's email [puppet] - 10https://gerrit.wikimedia.org/r/355748 [08:59:26] (03CR) 10Marostegui: [C: 032] check_private_data_report: Add Jaime's email [puppet] - 10https://gerrit.wikimedia.org/r/355748 (owner: 10Marostegui) [09:05:00] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [09:11:00] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [09:28:33] 06Operations: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3294117 (10Volans) The command to run this across the fleet (skipping the hosts currently down) is: ``` sudo cumin -m async -d -b 4 'F:facterversion = "2.4.6" and not mw2140*' \ 'disable-puppet "Testing... [09:30:02] !log slowly testing if puppet stringify_facts=false is a noop across the fleet T166372 [09:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:10] T166372: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372 [09:43:04] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.32 seconds [09:43:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "need to address the nutcracker issue." [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) (owner: 10Giuseppe Lavagetto) [09:44:04] (03CR) 10Jcrespo: [C: 031] Revert "Revert "mariadb: allow reimage of db2048 for upgrade to jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/354448 (owner: 10Jcrespo) [09:44:08] (03PS2) 10Jcrespo: Revert "Revert "mariadb: allow reimage of db2048 for upgrade to jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/354448 [09:44:21] (03CR) 10Jcrespo: [C: 032] Revert "Revert "mariadb: allow reimage of db2048 for upgrade to jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/354448 (owner: 10Jcrespo) [09:44:47] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3294144 (10elukey) Moritz build hhvm_3.18.2+dfsg-1+wmf4+exp1_amd64.deb with a patch fo... [09:47:28] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: add redis instances for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) [09:51:47] 06Operations, 10ops-codfw, 13Patch-For-Review: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3294146 (10jcrespo) 05Open>03Resolved I do not know why, but this worked as intended cc @MoritzMuehlenhoff and fixed the jessie installation iss... [09:52:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.322 second response time [10:03:49] Hi, is there anybody with production querying access? I have noticed very wried problem. An article displays a category at the bottom of the page but the article isn't anymore in the article (by [[Category:Cat]]). [10:04:21] I'd like to know if it is really in categorylinks. Maybe you have other idea... [10:05:27] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3294198 (10fgiunchedi) >>! In T166181#3292320, @Eevans wrote: > * Finally, if there were somewhere on the network with sufficient disk space, the existing data files could b... [10:08:23] (03CR) 10Giuseppe Lavagetto: [C: 032] "Added a note to the jobrunner hiera file where redis::shards is referenced." [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) (owner: 10Giuseppe Lavagetto) [10:08:45] 06Operations, 06Analytics-Kanban, 10DBA: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294201 (10elukey) Hello @Marostegui, thanks a lot for the heads up! I checked `megacli -AdpBbuCmd -a0` again and this is the status: ``` BBU Capacity Info for Adapter: 0 Relative State of Charg... [10:09:02] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294202 (10elukey) [10:09:13] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10elukey) a:03elukey [10:10:18] (03PS3) 10Giuseppe Lavagetto: role::jobqueue_redis: add redis instances for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) [10:10:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::jobqueue_redis: add redis instances for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/355751 (https://phabricator.wikimedia.org/T161710) (owner: 10Giuseppe Lavagetto) [10:10:50] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294205 (10Marostegui) Hello, If you are planning to keep that host for a long time (which I assume so) - I would definitely replace the BBU. I think @Cmjohnson might have spares fr... [10:12:16] <_joe_> jynus: can I merge your change? [10:12:45] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10jcrespo) Everything you say is correct. We are decommissioning many 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294211 (10elukey) Yes let's replace the BBU, will wait for a confirmation from @Cmjohnson then! [10:13:01] _joe_: yes [10:13:05] I was about to do it [10:13:36] thank you [10:13:36] <_joe_> jynus: merged [10:13:38] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294212 (10Marostegui) [10:14:36] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3294214 (10Gilles) There is an apparent performance improvement that coincides in timing, but on a simulated slow internet connection: {T166373}... [10:14:53] <_joe_> ok something didn't work in my change [10:15:10] <_joe_> but nothing too serious [10:19:04] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [10:21:04] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 37.93% of data above the critical threshold [1800.0] [10:22:08] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc_instace: fix template [puppet] - 10https://gerrit.wikimedia.org/r/355756 [10:23:06] are the above alarms related to the wikidata one? [10:23:15] <_joe_> I guess so? [10:23:22] I have no idea about these [10:23:41] should we ping somebody ? [10:23:44] PROBLEM - Check health of redis instance on 6382 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1495794215 600 - REDIS 2.8.17 on 127.0.0.1:6382 has 0 databases (), up 9 minutes 48 seconds - replication_delay is 1495794215 [10:23:52] <_joe_> that's ok ^^ [10:24:03] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3294216 (10fgiunchedi) @Anomie yeah I don't know either what would happen in that case, I think we can turn it into a 400 for now... [10:24:09] welcome 6382 \o/ [10:25:25] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [10:26:00] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::redis::multidc_instace: fix template [puppet] - 10https://gerrit.wikimedia.org/r/355756 (owner: 10Giuseppe Lavagetto) [10:26:50] * gehel is having a look at wdqs [10:27:01] <_joe_> gehel: what does that alarm mean btw? [10:27:44] RECOVERY - Check health of redis instance on 6382 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6382 has 0 databases (), up 13 minutes 48 seconds [10:28:19] there is MariaDB Slave Lag: s5 on dbstore1002, which is the slowest of databases [10:28:32] so there must be high write activity on s5 (wikidata) [10:28:54] aka a bot doing lots of edits, which has happened many times in the past [10:29:21] <_joe_> so that causes replication lag [10:29:24] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [10:29:25] <_joe_> ok that's what I thought [10:29:58] databases, except the one I mention [10:30:10] are slowed down to make things not lag (at least ideally) [10:30:16] by mediawiki [10:30:46] bot other components not in mediawiki doesn't necesarilly match database speed (and that is despite being too slow in my opinion) [10:31:14] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [10:31:24] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1800.0] [10:32:22] we now have 3500 edits per minute, many of those are probably wikidata based on the recentchanges [10:32:56] we do normally 1000-1500 [10:33:01] but among all wikis [10:33:21] https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-7d&to=now [10:33:24] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [10:34:58] this may be starting to get problematic? https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&panelId=12&fullscreen&orgId=1 [10:35:19] <_joe_> oh indeed something is going on [10:35:20] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&panelId=12&fullscreen&orgId=1&from=now-30d&to=now [10:35:24] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [10:35:27] <_joe_> does the timing coincide? [10:35:29] no [10:35:34] that is much before [10:35:50] <_joe_> yeah [10:35:52] wikidata is a punctual stress thing, but this is much long term [10:35:54] <_joe_> I'll take a look [10:36:18] I saw it some days ago, but as it later went to baseline, I forgot [10:36:33] but apparently it started growing again [10:37:54] <_joe_> htmlCacheUpdate: 998738 queued; on wikidata [10:38:02] <_joe_> so that's at least part of the problem [10:38:07] yeah [10:38:11] <_joe_> jynus: do you already know which bot is it? [10:38:20] <_joe_> we might throttle it [10:38:20] there are serveral users [10:38:29] ‎CC0-JS [10:39:29] I think is the most active right now [10:39:31] https://www.wikidata.org/w/index.php?title=Special:Contributions/CC0-JS&offset=&limit=500&target=CC0-JS [10:39:51] more than the limits we way are reasonable on the api etiquete [10:40:12] _joe_: like jynus said, probably related to high writes on wikidata [10:40:54] ~2200 edits per minute [10:41:06] more than double the edits we have on all wikis at the same time [10:41:57] https://grafana-admin.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1 [10:42:16] oh, that is nice [10:42:30] let's show it for the folks that do not have an account [10:42:36] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1 [10:42:39] yes, sorry :) [10:42:57] <_joe_> Amir1: around? [10:43:02] <_joe_> we need to block this bot [10:43:08] _joe_: yeah [10:43:15] which bot? [10:43:21] <_joe_> https://www.wikidata.org/w/index.php?title=Special:Contributions/CC0-JS&offset=&limit=500&target=CC0-JS [10:43:21] though in this case, it looks like the wdqs updater is not only lagging, it is actually stuck... [10:43:35] <_joe_> gehel: ok, the jobqueue issue is very real [10:43:41] yeah, I have a ticket about wikidata [10:43:58] not being able to take the thoughput mediawiki takes [10:44:11] _joe_: done [10:44:15] jynus: you mean wdqs? [10:44:17] <_joe_> Amir1: thanks [10:44:25] I make a note in the talk age [10:44:33] both wdqs and non-mediawiki wikidata backends in general [10:44:38] <_joe_> yup, I can reinforce the concepts [10:44:50] like extension only and other stuff they have [10:44:58] jynus: ok [10:45:29] How was mwscript run to find htmlCacheUpdate: 998738 _joe_? [10:45:30] It has dropped after Amir1 blocks: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1 [10:45:59] (I am trying to collect quick commands for the job queues) [10:46:16] althoug there were other bots at the same time contributing, too, I think [10:46:43] on the wdqs side, it does not look like a throughput issue, but a stability issue [10:46:45] jynus: qq - how did you find that CC0-JS was the most active bot? [10:46:59] I literally went to recentchanges, show bots [10:47:05] <_joe_> https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-3h&to=now wow, quite a drop :P [10:47:23] I think my phase as wikimedian help me a bit :-) [10:47:27] *helped [10:47:57] PROBLEM - MariaDB Slave SQL: s2 on db2049 is CRITICAL: CRITICAL slave_sql_state could not connect [10:47:57] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1512.80 seconds [10:47:57] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21330.66 seconds [10:47:57] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:47:59] <_joe_> elukey: I usually trust the api logs to get the sneaky ones that don't edit [10:48:16] PROBLEM - DPKG on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:16] PROBLEM - Hadoop DataNode on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:17] PROBLEM - configured eth on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:17] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1540.92 seconds [10:48:22] PROBLEM - mysqld processes on db2049 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:48:22] PROBLEM - MariaDB Slave IO: s2 on db2049 is CRITICAL: CRITICAL slave_io_state could not connect [10:48:22] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 30774.86 seconds [10:48:22] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1547.88 seconds [10:48:26] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag could not connect [10:48:34] <_joe_> marostegui: is that expected? ^^ [10:48:36] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 54040.61 seconds [10:48:37] PROBLEM - Disk space on graphite1002 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [10:48:38] ok, sevearal things here [10:48:43] coming back from downtime? [10:48:44] db2049 is down [10:48:46] PROBLEM - Disk space on Hadoop worker on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:47] and can be ignored [10:48:48] PROBLEM - dhclient process on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:48] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:48] <_joe_> oh ok [10:48:49] PROBLEM - Disk space on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:51] but the oethers? [10:48:55] analytics1030 too [10:49:01] elastic2020 too [10:49:04] jynus: I downtimed them for a week (alters on s4) [10:49:05] downtime lost? [10:49:10] seems so [10:49:11] icinga issue? [10:49:12] :/ [10:49:13] again [10:49:24] <_joe_> heh [10:49:31] I think I will fix it by brute force, or my heart will collapse [10:49:35] <_joe_> I'm pretty sure icinga was reloaded [10:49:36] PROBLEM - HP RAID on elastic2020 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, Controller, Battery/Capacitor - Failed: 1I:1:2 [10:50:03] <_joe_> after my changes [10:50:07] I have just reimaged analytics1030 and it was downtimed, not sure if it is remotely possible that this has dropped all the downtimes [10:50:25] just mentioning for the records :) [10:50:25] marostegui: the s4 hosts are executing a schema change, right? [10:50:26] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:50:26] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1946 bytes in 0.130 second response time [10:50:33] jynus: yes [10:50:38] elukey: [10:50:38] I downtimed them for a week yesterday [10:50:46] <_joe_> I repeat, icinga was reloaded because of my puppet changes [10:50:50] I am reviewing my terminal buffer, and there it is [10:50:53] ah [10:50:55] https://phabricator.wikimedia.org/T164206 [10:51:13] but shouldn't downtime be stored on diks? [10:51:23] _joe_: did your change changed the hosts list? [10:51:36] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:51:46] PROBLEM - HP RAID on ms-be1029 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:51:53] oh, that would be nice if it paired with that [10:51:56] PROBLEM - HP RAID on ms-be1038 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:52:01] I'll take care of the hp raid [10:52:16] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:52:47] I'll ack 1030 in a bit [10:52:50] If there is anything I can help, ping me [10:52:58] <_joe_> so that bot was responsible for 80% of edits currently [10:53:10] on all wikis, right? [10:53:36] RECOVERY - Disk space on graphite1002 is OK: DISK OK [10:53:44] that would maybe fit another issue I saw [10:54:04] if it is cache invalidation jobs [10:54:21] https://phabricator.wikimedia.org/T164173 [10:54:53] <_joe_> jynus: yes [10:54:56] because it happened on several wikis, we teorized it could be some shared component [10:55:02] like commons or wikidata [10:55:21] for the record, wdqs issue tracked at T166378 [10:55:22] T166378: wdqs-updater fails when tail-poller queue is full - https://phabricator.wikimedia.org/T166378 [10:55:56] PROBLEM - MegaRAID on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:56:29] !log restart wdqs-updater on all wdqs nodes - T166378 [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:18] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3294263 (10jcrespo) Happened again today at 10:47. [11:00:45] 06Operations: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3294264 (10Volans) First diff found on `scb1004`: ``` (1) scb1004.eqiad.wmnet ----- OUTPUT ----- --- /dev/fd/63 2017-05-26 10:51:40.147441209 +0000 +++ /dev/fd/62 2017-05-26 10:51:40.147441209 +0000 @... [11:00:56] PROBLEM - salt-minion processes on analytics1030 is CRITICAL: Return code of 255 is out of bounds [11:01:56] RECOVERY - HP RAID on ms-be1038 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:03:35] jynus: i have downtimed db2049 again for a week [11:06:01] thanks [11:06:47] RECOVERY - dhclient process on analytics1030 is OK: PROCS OK: 0 processes with command name dhclient [11:06:56] RECOVERY - Disk space on analytics1030 is OK: DISK OK [11:06:56] RECOVERY - salt-minion processes on analytics1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:07:16] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1030 is OK: OK ferm input default policy is set [11:07:16] RECOVERY - DPKG on analytics1030 is OK: All packages OK [11:07:16] RECOVERY - configured eth on analytics1030 is OK: OK - interfaces up [11:08:36] PROBLEM - NTP on analytics1030 is CRITICAL: NTP CRITICAL: No response from NTP server [11:09:56] I can handle the other hosts, but do not want to touch them (set as down) without confirmation [11:10:04] analytics1030 [11:10:14] and elastic2020 [11:10:30] I think volans said they are ok to ack? [11:10:51] jynus: analytics1030 was just reimaged by elukey so I'd leave it to him [11:11:00] yes [11:11:06] elastic2020 is T149006 [11:11:06] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [11:11:16] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: CRITICAL: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: LOST [11:11:18] ok, I will ack and point to that task [11:11:20] thanks [11:11:33] thank you [11:11:46] no, thank you for pointing that [11:11:46] RECOVERY - Disk space on Hadoop worker on analytics1030 is OK: DISK OK [11:12:01] an1030 should be good in ~30 mins [11:12:04] completing the work [11:12:21] that is cool, it is all icinga's fault [11:12:34] or our puppetization of the restarts [11:12:47] yeah! [11:13:28] it would be nice to know what exactly joe did, he seemed quite confident on him restarting icinga [11:13:35] so there may be a trigger there [11:13:47] like you said, only certain changes [11:15:57] RECOVERY - MegaRAID on analytics1030 is OK: OK: optimal, 13 logical, 14 physical [11:21:31] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:24:47] jynus: hey, regarding https://www.wikidata.org/wiki/User_talk:CC0-JS#Block I didn't want to make this to WMF vs. community crusade. That's why I did't mention WMF at all [11:24:59] just for the record [11:25:23] yes, but I didn't want to blame you alone on the decision [11:26:09] thanks :) [11:30:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1931 bytes in 0.175 second response time [11:34:41] 06Operations: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3294288 (10Volans) It seems expected to me, it is used through `$::processorcount` across different modules in puppet. And the reported diff is only in the parameters of the class. [11:49:55] (03PS1) 10Giuseppe Lavagetto: jobrunner: bump up the number of htmlCacheUpdate jobs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355760 [11:50:21] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:50:23] <_joe_> elukey, jynus when you're back [11:50:35] <_joe_> this patch ^^ should help consume the queue [12:01:21] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:03:21] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 35.14 seconds [12:05:12] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:11:51] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:12:21] RECOVERY - Hadoop DataNode on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [12:21:01] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [12:23:19] _joe_ sure makese sense [12:24:01] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:31:51] RECOVERY - HP RAID on ms-be1029 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:35:02] ACKNOWLEDGEMENT - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:35:02] ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:35:03] ACKNOWLEDGEMENT - High lag on wdqs1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:35:03] ACKNOWLEDGEMENT - High lag on wdqs2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:35:04] ACKNOWLEDGEMENT - High lag on wdqs2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:35:05] ACKNOWLEDGEMENT - High lag on wdqs2003 is CRITICAL: CRITICAL: 96.67% of data above the critical threshold [1800.0] Gehel updater is catching up after high edit rate [12:39:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355764 (https://phabricator.wikimedia.org/T166206) [12:41:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355764 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [12:42:21] 06Operations, 10ops-eqiad, 15User-Elukey: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3294377 (10elukey) 05Open>03Resolved All good, host reimaged! Thanks! [12:42:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355764 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [12:43:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355764 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [12:43:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T166206 (duration: 00m 41s) [12:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:08] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [12:44:25] !log Restart Hadoop daemons on analytics100[12] (Hadoop master nodes) for jvm upgrades [12:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:21] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3294387 (10Gilles) @bblack one way to verify that the performance improvement we're seeing is "real" would be to turn BBR off for a bit. That bei... [12:54:26] RECOVERY - mysqld processes on db2049 is OK: PROCS OK: 1 process with command name mysqld [12:55:01] !log Deploy alter table s4 on db1097 - T166206 [12:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:10] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [12:55:21] RECOVERY - MariaDB Slave IO: s2 on db2049 is OK: OK slave_io_state Slave_IO_Running: Yes [12:55:51] RECOVERY - MariaDB Slave SQL: s2 on db2049 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:56:41] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [12:57:26] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294405 (10Ottomata) How soon is likely to happen? Early next FY or later? If within Q1, I'd say let's just wait and replace the box. Otherwise, let's fix the BBU. Eh? [12:59:45] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294408 (10Marostegui) It is probably worth saying that the BBU might have been broken for a long time. We noticed because of the new check, but it would be too much of... [13:00:50] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294409 (10jcrespo) I agree with Manuel. while I would like to do the replacement ASAP, in reality it is not going to happen until Q2 or later. [13:00:53] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294410 (10Ottomata) > for a long time still. Agree but how long! It is slated for replacement next FY year sometime, right? Maybe we can just do it sooner rather th... [13:02:14] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294411 (10jcrespo) The reasoning is that labsdb has priority, and it is even on the best interest of analytics to to that first, if I understood correctly CC @Nuria [13:06:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294413 (10elukey) @Ottomata if Chris finds a BBU that among the spare parts that we have I'd say that we can do it asap, it should be a relatively painless downtime fo... [13:08:42] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294415 (10Marostegui) Another tip, once it is replaced (if it is) try to monitor its temperature once it boots up - in the last few weeks during some server moves we n... [13:09:05] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294419 (10Ottomata) +1 [13:09:41] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [13:11:41] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [13:17:12] (03PS1) 10Filippo Giunchedi: phabricator: redirect serveraliases homepage to phab_servername [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) [13:18:04] 06Operations, 10Phabricator, 10Traffic, 13Patch-For-Review: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3294429 (10fgiunchedi) @mmodell ok thanks! I gave it a try with https://gerrit.wikimedia.org/r/355769 for all server aliases, if that's too broad we can p... [13:18:24] (03CR) 10jerkins-bot: [V: 04-1] phabricator: redirect serveraliases homepage to phab_servername [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) (owner: 10Filippo Giunchedi) [13:19:21] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [13:19:56] !log restart wdqs-updater on all wdqs nodes - T166378 [13:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:05] T166378: wdqs-updater fails when tail-poller queue is full - https://phabricator.wikimedia.org/T166378 [13:20:28] 06Operations, 10Phabricator, 10Traffic, 13Patch-For-Review, 15User-fgiunchedi: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3294447 (10fgiunchedi) [13:20:39] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic, and 2 others: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3294448 (10fgiunchedi) [13:25:14] <_joe_> elukey: around? [13:25:33] (03PS2) 10Filippo Giunchedi: phabricator: redirect serveraliases homepage to phab_servername [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) [13:25:56] _joe_ yep! [13:26:14] <_joe_> elukey: I'll merge the htmlCacheUpdate bump, fyi [13:26:28] +1 from my side [13:28:00] apparently, T164173 is not directly related [13:28:00] T164173: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173 [13:28:25] because Aaron said that HTMLCacheUpdateJob does not create that, but other kind of jobs [13:33:39] 06Operations, 10DBA, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3294468 (10jcrespo) While contention is bad in general- it is the opposite of lag- more contention would create less la... [13:34:01] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:21] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:31] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:31] PROBLEM - SSH on ms-be1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [13:35:11] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [13:35:23] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:35:24] RECOVERY - SSH on ms-be1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:37:11] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: bump up the number of htmlCacheUpdate jobs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/355760 (owner: 10Giuseppe Lavagetto) [13:41:33] error rates on job queries have increased [13:42:12] anything specific ? [13:42:20] RecentChange::save [13:42:21] and [13:42:29] no [13:42:33] sorry, that is not related [13:42:41] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [13:42:53] CategoryMembershipChangeJob::run I think [13:42:54] <_joe_> jynus: I'm still running puppet around [13:43:08] <_joe_> and that is not one of the jobs I bumped [13:43:30] I wasn't necesarily blaming that [13:43:34] just saying [13:45:08] those happens from time to time, too [13:49:05] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3294489 (10thcipriani) >>! In T166345#3293922, @Marostegui wrote: > I believe this would m... [13:52:22] (03CR) 10Mark Bergsma: "Splitting up is good, but could this be done without all the white space changes at the same time?" [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 (owner: 10Giuseppe Lavagetto) [13:52:46] <_joe_> mark: not really if we want to pass pep8 :P [13:53:11] 1) I don't particularly care about pep8, and 2) it can be done separately later :) [13:53:26] <_joe_> well it's in our CI [13:53:36] <_joe_> anyways, that's obviously easy to do [13:53:50] some white space changes were weird btw [13:53:58] e.g. where a dict was used as a case statement [13:54:34] <_joe_> I'll get to it once I tamed the jobqueue [13:54:42] hehe thanks [13:55:25] <_joe_> as usual, adding workers doesn't obviously speed things up [13:55:47] <_joe_> so I'm going to launch jobs manually from terbium (aka: the medieval option) [13:56:12] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [13:56:22] <_joe_> but I guess "smart scheduling" is not something one can ask of a 200 lines php script, heh [13:57:42] <_joe_> !log consuming the backlog of htmlCacheUpdate jobs for enwiktionary [13:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:18] (03PS1) 10Marostegui: db-codfw.php: Repool db2050, depool db2043 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355779 (https://phabricator.wikimedia.org/T166278) [14:00:47] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2050, depool db2043 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355779 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:02:07] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2050, depool db2043 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355779 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:02:19] (03CR) 10jenkins-bot: db-codfw.php: Repool db2050, depool db2043 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355779 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [14:03:06] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2050, depool db2043 - T166278 (duration: 00m 41s) [14:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:16] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [14:04:34] !log Deploy alter tabel on s3 revision table db2043 - T166278 [14:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] (03PS1) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 [14:15:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (owner: 10Ottomata) [14:18:55] <_joe_> mark: ahah I just noticed, the flake8 ci job excludes the "pybal" directory [14:18:59] <_joe_> no shit it passes :P [14:19:09] heh [14:19:16] (03PS2) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 [14:19:47] (03PS3) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:21:18] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [14:21:26] (03PS4) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:21:50] <_joe_> and indeed, even removing things that are clearly part of your style, we got 233 violations in the code tree [14:21:55] the real problem is that using twisted it will never pass pep8 [14:22:01] they use pyflakes [14:22:10] <_joe_> volans: you can exclude things, you know [14:22:16] (03PS5) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:22:22] yeah, once you exclude half of things [14:22:29] than dunno if it makes sense ;) [14:22:52] <_joe_> volans: a few useful things come out of flake8 besides cosmethics [14:23:04] <_joe_> unused imports, unbound/unused variables, etc [14:23:23] I'm sure, I'm always all in for pep8, you know ;) [14:23:24] <_joe_> and no, actually there is nothing twisted-related in the output I see from tox after my changes [14:23:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [14:24:37] (03PS1) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [14:24:58] (03PS6) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:25:17] (03CR) 10Marostegui: [C: 031] "+100000" [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [14:25:42] (03PS2) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [14:26:38] (03PS7) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:27:14] (03PS8) 10Ottomata: [WIP] Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) [14:27:47] (03CR) 10jerkins-bot: [V: 04-1] mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 (owner: 10Jcrespo) [14:29:14] (03PS3) 10Jcrespo: mysql-client: Install colordiff (neodymium & sarin) [puppet] - 10https://gerrit.wikimedia.org/r/355783 [14:29:16] !log Stop pt-table-checksum on s1 - T162807 [14:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:26] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [14:35:17] (03CR) 10Ottomata: "No-op so far on restbase nodes:" [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [14:38:45] (03PS2) 10Giuseppe Lavagetto: Split up pybal.pybal in smaller files [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 [14:38:47] (03PS1) 10Giuseppe Lavagetto: Create flake8 rules that make sense in our context [debs/pybal] - 10https://gerrit.wikimedia.org/r/355784 [14:39:17] (03CR) 10Giuseppe Lavagetto: "> Splitting up is good, but could this be done without all the white" [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 (owner: 10Giuseppe Lavagetto) [14:40:15] (03CR) 10jerkins-bot: [V: 04-1] Create flake8 rules that make sense in our context [debs/pybal] - 10https://gerrit.wikimedia.org/r/355784 (owner: 10Giuseppe Lavagetto) [14:43:16] <_joe_> that second patch is intended to fail btw :P [14:43:26] <_joe_> I'm not sure if I'll do that in main or in 2.0-dev [14:46:42] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3294553 (10jcrespo) [14:47:17] (03CR) 10Paladox: [C: 04-1] "I've tested this. I've found that this has to be way up near the top. I've tested it above where it defines the path it loads the site fro" [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) (owner: 10Filippo Giunchedi) [14:48:17] Just needs tweaking ^^ then it can be merged :) [14:51:01] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [14:59:23] (03PS1) 10Cmjohnson: Adding mgmt dns entries for stat1005 and stat1006 T165366 T165368 [dns] - 10https://gerrit.wikimedia.org/r/355785 [15:00:12] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for stat1005 and stat1006 T165366 T165368 [dns] - 10https://gerrit.wikimedia.org/r/355785 (owner: 10Cmjohnson) [15:00:42] nicee! [15:09:42] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3294616 (10cscott) There was a wikitext preprocessor patch in wmf.2 that I'm responsible f... [15:10:28] (03PS1) 10Cmjohnson: Adding mgmt dns entries for dumpsdata1001/2 T165368 [dns] - 10https://gerrit.wikimedia.org/r/355786 [15:10:31] (03CR) 10Mark Bergsma: [C: 032] Split up pybal.pybal in smaller files [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 (owner: 10Giuseppe Lavagetto) [15:11:31] (03Merged) 10jenkins-bot: Split up pybal.pybal in smaller files [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 (owner: 10Giuseppe Lavagetto) [15:11:39] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for dumpsdata1001/2 T165368 [dns] - 10https://gerrit.wikimedia.org/r/355786 (owner: 10Cmjohnson) [15:12:39] _joe_: [15:12:40] File "/Users/mark/Documents/workspace/pybal/pybal/test/test_ipvs.py", line 9, in [15:12:40] import pybal.ipvs [15:12:40] File "/Users/mark/Documents/workspace/pybal/pybal/ipvs.py", line 7, in [15:12:40] from . import util [15:12:41] ImportError: cannot import name util [15:13:22] <_joe_> mark: uhm how is it the test pass on CI? [15:13:33] not sure [15:13:35] <_joe_> and yeah, I'll fix that [15:13:45] coverage run --source=pybal setup.py test [15:14:21] <_joe_> mark: are you sure coverage doesnt' run with python 3? [15:14:32] <_joe_> because that can be the reason of that error [15:14:33] nope, 2.7..13 [15:14:42] <_joe_> ok, then it's definitely strange [15:15:29] indeed [15:15:52] * mark is slowly but surely writing test cases for bgp [15:16:01] not the best ones, but at least getting most code some coverage helps [15:16:07] <_joe_> yes [15:16:10] there are still a few typos in there of code obviously never having been run ;) [15:16:32] (03PS1) 10BBlack: interface-rps: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/355787 [15:16:35] (03PS1) 10BBlack: interface-rps: clean up typing/format-string issues [puppet] - 10https://gerrit.wikimedia.org/r/355788 [15:16:36] (03PS1) 10BBlack: interface-rps: refactor opts handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/355789 [15:16:38] (03PS1) 10BBlack: interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 [15:17:41] (03CR) 10jerkins-bot: [V: 04-1] interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 (owner: 10BBlack) [15:18:45] it looks pretty to me :P [15:18:55] (03PS1) 10Cmjohnson: Adding mgmt dns entries for labvirt101[5-8] T165531 [dns] - 10https://gerrit.wikimedia.org/r/355792 [15:18:57] <_joe_> mark: did you remove .pyc files before running coverage? [15:23:57] _joe_: i did now, doesn't help [15:24:02] let me see if it's my changes... [15:24:16] <_joe_> mark: it works on my computer, and on wmf's CI [15:24:25] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for labvirt101[5-8] T165531 [dns] - 10https://gerrit.wikimedia.org/r/355792 (owner: 10Cmjohnson) [15:24:34] weird [15:24:41] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:24:45] alright, will fiddle, thanks [15:24:48] <_joe_> so I guess we can blame mac os [15:26:09] heresy! [15:27:25] (03PS2) 10BBlack: interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 [15:27:58] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3294663 (10thcipriani) To recap where this issue left off yesterday: Attempt to recreate... [15:28:08] <_joe_> mark: also, try to run PYTHONPATH="" coverage run --source pybal `which trial` pybal which is what CI is doing [15:29:10] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696835 (10Peter) That said, the navigation timing metrics from the real world doesn't give us the best measurements for user experience so I thi... [15:32:23] 06Operations: Error while enabling symlinked units on stretch systemd - https://phabricator.wikimedia.org/T166389#3294673 (10fgiunchedi) [15:36:55] _joe_: no doesn't work due to my pyenv environment [15:36:57] meh [15:37:06] * mark doesn't feel like debugging this shit now, it's hot :P [15:40:06] _joe_: `export PYTHONDONTWRITEBYTECODE=1` ;-) [15:40:42] (03PS1) 10Cmjohnson: Adding mgmt dns entries for kubestage1001/2 T166264 [dns] - 10https://gerrit.wikimedia.org/r/355794 [15:41:25] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for kubestage1001/2 T166264 [dns] - 10https://gerrit.wikimedia.org/r/355794 (owner: 10Cmjohnson) [15:43:09] (03PS3) 10Mark Bergsma: Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 [15:43:11] (03PS2) 10Mark Bergsma: Add bgp.ip unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355425 [15:43:13] (03PS2) 10Mark Bergsma: Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 [15:43:15] (03PS1) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [15:48:12] <_joe_> mark: maybe one of the problems you have is that you should run [15:48:15] <_joe_> PYTHONPATH="" coverage run --source pybal `which trial` pybal [15:48:29] <_joe_> instead of setup.py, which uses pyunit instead of trial [15:48:50] trial gives another error ;) [15:49:16] File "/Users/mark/Documents/workspace/pybal/pybal/test/test_ipvs.py", line 9, in [15:49:16] import pybal.ipvs [15:49:16] File "/Users/mark/Documents/workspace/pybal/pybal/ipvs.py", line 8, in [15:49:16] from pybal.bgpfailover import BGPFailover [15:49:16] File "/Users/mark/Documents/workspace/pybal/pybal/pybal.py", line 16, in [15:49:17] [15:49:19] exceptions.ImportError: cannot import name ipvs [15:49:33] <_joe_> pybal/pybal/pybal.py ? [15:49:38] <_joe_> that file is gone in master [15:49:42] indeed [15:50:15] wtf [15:51:02] gah [15:51:12] yeah it was a .pyc [15:51:42] ok, sorry for wasting your time :P [15:51:55] <_joe_> btw, I just decided I'll make the pep8 fixes on 2.0-dev for sure [15:52:08] yes [15:52:33] <_joe_> I'll rebase on top of the split I just did [15:52:55] <_joe_> and resubmit the fsm patch I guess on top of that [15:53:27] <_joe_> and then add the pep8 fixes I think are needed [15:55:10] TOTAL 3573 1183 67% [15:55:13] we're getting somewhere [15:55:26] <_joe_> :) [15:55:29] master is still at ~61% [15:55:44] <_joe_> a +6% is a good improvement [15:56:37] (03PS2) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 [15:58:34] (03PS1) 10Ottomata: [WIP] Puppetize TLS encryption and auth for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) [16:07:37] 06Operations, 10ops-codfw: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328#3294848 (10Papaul) a:03Papaul [16:12:37] (03PS3) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [16:25:49] (03PS7) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [16:25:51] (03PS4) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [16:25:55] (03PS4) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [16:31:40] 06Operations, 10Ops-Access-Requests, 10Analytics, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3294928 (10elukey) [16:36:15] (03Abandoned) 10GWicke: Update access log sampling to match new hyperswitch levels [puppet] - 10https://gerrit.wikimedia.org/r/342251 (owner: 10GWicke) [16:46:21] RECOVERY - Host mw2140 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:47:06] 06Operations, 10ops-codfw: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328#3295003 (10Papaul) a:05Papaul>03jcrespo Removed power for 5 minutes, Update IDRAC firmware from 1.57 to 2.32, update BIOS from 2.3 to 2.4. System is back up. [16:59:52] 06Operations, 10Deployment-Systems, 06Performance-Team, 06Release-Engineering-Team, 07HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3295034 (10Krinkle) p:05Triage>03High [17:01:31] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:36] 06Operations, 10Ops-Access-Requests, 10Analytics, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295046 (10RobH) a:05Nuria>03None This seems pretty straightforward, and I'll be on clinic next week, so just commenting the checklist: [x] - all users listed @P... [17:07:06] 06Operations, 10DBA, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3295071 (10aaron) p:05Triage>03Normal [17:13:22] (03PS1) 10RobH: adding gwicke and ppchelko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355804 [17:17:41] PROBLEM - mediawiki-installation DSH group on mw2140 is CRITICAL: Host mw2140 is not in mediawiki-installation dsh group [17:23:39] (03PS1) 10Dzahn: planet: add Wikimedia Cloud Services blog feed [puppet] - 10https://gerrit.wikimedia.org/r/355806 [17:30:31] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:37:23] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3295180 (10greg) Status update from RelEng: We don't have any plans to roll out wmf.2 any... [17:39:09] (03CR) 10Greg Grossmeier: [C: 031] "Happy to see this used more widely." [puppet] - 10https://gerrit.wikimedia.org/r/355806 (owner: 10Dzahn) [17:47:01] (03PS1) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [17:47:03] (03PS1) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [17:47:05] (03PS1) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [17:47:58] (03PS2) 10BBlack: interface-rps: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/355787 [17:47:59] (03PS2) 10BBlack: interface-rps: clean up typing/format-string issues [puppet] - 10https://gerrit.wikimedia.org/r/355788 [17:48:01] (03PS2) 10BBlack: interface-rps: refactor opts handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/355789 [17:48:03] (03PS3) 10BBlack: interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 [17:48:06] (03PS2) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [17:48:08] (03PS2) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [17:48:10] (03PS2) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [17:49:41] (03CR) 10jerkins-bot: [V: 04-1] facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 (owner: 10BBlack) [17:50:22] (03CR) 10jerkins-bot: [V: 04-1] [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 (owner: 10BBlack) [17:50:24] (03CR) 10jerkins-bot: [V: 04-1] NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 (owner: 10BBlack) [17:50:56] yeah yeah :P [17:50:57] (03PS3) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [17:50:59] (03PS3) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [17:51:01] (03PS3) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [17:51:24] someone needs to invent me a scripting language that accepts random mixes of python, ruby, perl, and C syntax :P [17:52:02] (03CR) 10jerkins-bot: [V: 04-1] facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 (owner: 10BBlack) [17:52:18] (03CR) 10jerkins-bot: [V: 04-1] NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 (owner: 10BBlack) [17:52:44] (03CR) 10jerkins-bot: [V: 04-1] [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 (owner: 10BBlack) [17:52:50] (03PS2) 10Dzahn: planet: add Wikimedia Cloud Services blog feed [puppet] - 10https://gerrit.wikimedia.org/r/355806 [17:53:37] dwim [17:59:04] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295298 (10Nuria) Approved on my end. I actually though all three had access already. [17:59:16] (03PS8) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [17:59:18] (03PS5) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [17:59:20] (03PS5) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [18:15:22] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:16:11] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [18:17:21] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [18:17:38] (03PS4) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [18:17:40] (03PS4) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [18:17:42] (03PS4) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [18:19:40] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295382 (10RobH) a:03GWicke @Gwicke: Can you please review and sign the L3 document? Once done, please feel free to assign to me for followup... [18:21:19] (03CR) 10jerkins-bot: [V: 04-1] NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 (owner: 10BBlack) [18:21:58] (03CR) 10Dzahn: [C: 032] planet: add Wikimedia Cloud Services blog feed [puppet] - 10https://gerrit.wikimedia.org/r/355806 (owner: 10Dzahn) [18:22:25] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295392 (10GWicke) a:05GWicke>03RobH @RobH, signed it earlier. Thanks for the quick follow-up! [18:23:13] (03PS5) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [18:23:15] (03PS5) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [18:27:47] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292124 (10RobH) I'll be on clinic duty next week, so I am just jumping in here to clarify what is pending on this: * This request is sudo related,... [18:28:40] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295411 (10RobH) Thanks! Pending no objections (and I don't think there will be any), I'll merge this live on Wednesday! [18:36:21] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:37:07] (03PS4) 10BBlack: interface-rps: optional NUMA awareness [puppet] - 10https://gerrit.wikimedia.org/r/355790 [18:37:09] (03PS5) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [18:37:11] (03PS6) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [18:37:12] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [18:37:13] (03PS6) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [18:39:53] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3295443 (10bd808) This won't give me +2 in ops/puppet.git so "applying puppet patches" probably isn't actually possible. I would be able to force pu... [19:01:02] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3295484 (10RobH) So, I was chatting with Roan about the access he had, when he had root. His root access back then did not immediately require that... [19:03:48] (03PS2) 10Dzahn: wikistats: install php7.0-xml if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/355742 (https://phabricator.wikimedia.org/T165879) [19:04:39] (03CR) 10Dzahn: [C: 032] "on stretch installing php7.0-xml is needed to get PHP Class DOMDocument, on jessie it seems to be there without an additional package to t" [puppet] - 10https://gerrit.wikimedia.org/r/355742 (https://phabricator.wikimedia.org/T165879) (owner: 10Dzahn) [19:07:07] (03PS3) 10Dzahn: wikistats: install php7.0-xml if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/355742 (https://phabricator.wikimedia.org/T165879) [19:07:21] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [19:10:54] (03PS9) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [19:10:56] (03PS6) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [19:10:58] (03PS6) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [20:02:31] (03CR) 10Andrew Bogott: [C: 031] [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) (owner: 10DCausse) [20:02:36] (03PS2) 10Andrew Bogott: [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) (owner: 10DCausse) [21:02:56] (03PS2) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [21:11:15] (03CR) 10Dzahn: "@Alex heh, yea, duuh.. i just needed @ipv4 and @ipv6 they are already class parameters." [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [21:29:48] (03CR) 10Dzahn: "i would have tried to use a fact to replace the Hiera setting of IPv4 and IPv6 altogether. since that is the second IP on eth0 it is not d" [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [21:37:58] (03PS3) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [21:47:24] (03CR) 1020after4: "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/355769 (https://phabricator.wikimedia.org/T166120) (owner: 10Filippo Giunchedi) [21:47:48] 06Operations, 10Phabricator, 10Traffic, 13Patch-For-Review, 15User-fgiunchedi: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3295761 (10mmodell) p:05Triage>03Normal [21:50:59] (03CR) 10Krinkle: [wikitech] Increase weight on Tool and Nova Resource ns (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) (owner: 10DCausse) [22:01:58] (03PS4) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [22:14:51] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6548/" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [22:15:23] (03CR) 10Dzahn: "same here, all it was is use the existing @ipv4 and @ipv6 in the template. should have never put the profile in it" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [22:25:21] PROBLEM - Apache HTTP on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.335 second response time [22:26:22] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.117 second response time [22:28:02] (03CR) 10Krinkle: "Seems worth testing this in Vagrant or Beta with LocalSetting matching and non-matching and look for a difference in observed latency or r" [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [22:45:05] Hey, were there any changes to DNS recently? I'm investigating corp.wikimedia.org [22:45:42] I'm out of the office, but believe the DNS service is still working because people in the office still have internet [22:55:37] byron: the last ones were hours ago and look like just labvirt-kubernetes stuff. what's broken [22:58:26] byron: it seems ns1.corp is failing [22:58:36] status: SERVFAIL [23:01:01] looking up corp.wm.org on ns0.wm.org works as normal and says how ns1/ns2.corp are the authority, but those fail then [23:03:12] (03PS1) 10Dzahn: LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/355869 [23:06:43] (03PS2) 10Dzahn: LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) [23:12:42] (03PS1) 10Dzahn: phabricator: reactivate exim ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/355870 [23:14:57] (03PS2) 10Dzahn: phabricator: reactivate exim ganglia stats [puppet] - 10https://gerrit.wikimedia.org/r/355870 [23:15:22] (03CR) 10Dzahn: [C: 032] "on the labs instance: root: root@wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/355870 (owner: 10Dzahn) [23:22:44] (03PS1) 10Dzahn: phabricator: move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/355871 [23:33:23] mutante: must be an issue on the dns server then [23:40:13] 06Operations, 06Performance-Team, 06Services, 07Availability (Multiple-active-datacenters): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3295832 (10aaron) p:05Normal>03Low [23:43:38] mutante: OK, looks like it's fixed [23:47:40] byron: :) [23:47:59] Thanks for responding though!!! [23:48:01] yes, confirmed [23:48:08] it looks workign again [23:48:10] yw